[フレーム]
BT

InfoQ Software Architects' Newsletter

A monthly overview of things you need to know as an architect or aspiring architect.

View an example

We protect your privacy.

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Unlock the full InfoQ experience

Unlock the full InfoQ experience by logging in! Stay updated with your favorite authors and topics, engage with content, and download exclusive resources.

Log In
or

Don't have an InfoQ account?

Register
  • Stay updated on topics and peers that matter to youReceive instant alerts on the latest insights and trends.
  • Quickly access free resources for continuous learningMinibooks, videos with transcripts, and training materials.
  • Save articles and read at anytimeBookmark articles to read whenever youre ready.

Topics

Choose your language

InfoQ Homepage News Discord Migrates Trillions of Messages from Cassandra to ScyllaDB

Discord Migrates Trillions of Messages from Cassandra to ScyllaDB

This item in japanese

Jun 22, 2023 2 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.
Get in touch

Discord has migrated trillions of message records from Apache Cassandra to ScyllaDB, reducing the size of the largest cluster from 177 Cassandra nodes to 72 ScyllaDB nodes and reducing tail latencies for reads and writes. The move has unlocked new product use cases because of the improved database stability and performance.

As Discord grew, it migrated its data from MongoDB to Cassandra in 2017 because it was looking for a scalable database to handle ever-growing data volumes. Initially, the Cassandra cluster consisted of 12 nodes and stored billions of messages. Still, after five years, the cluster had 177 nodes and was frequently experiencing performance problems, forcing the team to reduce some of the maintenance operations, which became too expensive to run.

Some performance issues were caused by hot partitions resulting from the table schema design, where partitioning was based on the Discord channel and time bucket. Bo Ingram, a senior software engineer at Discord, explains the impact of hot partitions on the database cluster:

When we encountered a hot partition, it frequently affected latency across our entire database cluster. One channel and bucket pair received a large amount of traffic, and latency in the node would increase as the node tried harder and harder to serve traffic and fell further and further behind. [...] Since we perform reads and writes with a quorum consistency level, all queries to the nodes that serve the hot partition suffer latency increases, resulting in a broader end-user impact.

Based on the experimenting and testing done internally, the team has decided to move its data across all clusters to ScyllaDB. They opted for ScyllaDB primarily to improve performance, including avoiding garbage-collection-related issues they experienced with Cassandra. They also worked with the ScyllaDB team to improve some use cases they depended on, like reverse queries.

After migrating all smaller clusters by 2020, the team prepared to migrate the biggest cluster, containing trillions of messages. To minimize the hot partition problem, they created a new intermediary service layer in their architecture, named data services, written in Rust and interfaced via gRPC API.

One important responsibility of data services is request coalescing, which avoids multiple database calls when many users request the same message. Secondly, the team implemented consistent hash-based routing to data service instances based on a routing key, such as a channel id. Together, these changes significantly reduced hot partition problems, giving Discord extra time to prepare for the big migration.

Source: https://discord.com/blog/how-discord-stores-trillions-of-messages

For the migration itself, the team first considered using ScyllaDB’s Apache Spark migrator but, in the end, decided to implement a bespoke solution in Rust with SQLite used for checkpointing, which allowed them to shorten the migration time from three months to nine days. After addressing some minor hiccups, they validated the completed migration and switched to ScyllaDB in May 2022. Since then, the new cluster has proven stable and provided consistent performance, which, together with the data service layer, allowed it to handle extra traffic generated by the World Cup gracefully.

About the Author

Rafal Gancarz

Show moreShow less

Rate this Article

Adoption
Style

Related Content

The InfoQ Newsletter

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.

BT

AltStyle によって変換されたページ (->オリジナル) /