• The Blueprint
  • Posts
  • How Discord indexes 100B+ messages without breaking their bank

How Discord indexes 100B+ messages without breaking their bank

🫡 GM Busy Engineers. With message search being the most requested feature, Discord had a fun system design challenge on their hands and we love covering a good system design problem :)

Btw, Discord is hiring!

The Problem

User’s wanted search. To put this into effect, Discord had a few…

Requirements

  • Budget: Since search is a utility feature, Discord did not want to spend exorbitantly

  • Horizontally Scalable: Just like their Cassandra message store (up to a certain point), scaling should be as easy as adding new nodes

  • Lazily Indexed: Since not everyone uses search, messages should be indexed on a per-request basis (this also helps with costs)

Now, the easiest solution here is to use a managed service (AWS Opensearch and the like) to handle indexing. But at Discord’s “billions of messages” scale, indexing costs would send expenses through the roof!

Enter ElasticSearch

Discord compared ElasticSearch (ES) and Solr and went with ES for system compatibility, ease of horizontal scaling (automatic shard rebalancing) and convenience (engineers knew ES already).

The Little Big Detail

The engineering thoughtfulness is what makes Discord’s engineering talent so awesome!

“Elasticsearch likes it when documents are indexed in bulk”. This is great because bulk indexing reduces tonnes of overhead but also means real-time indexing is a no-no → users can’t search for recent texts.

However, Discord realized their users mostly search for historical texts and so a small delay between when a message was posted and when it became searchable was reasonable!

System Design

Ingestion

There are two forms of data to ingest: Real-time and historical. Discord implemented a message queue to allow new messages to be async. consumed by a pool of indexing workers that bulk insert into Elasticsearch from the queue.

For historical data, they took a classical approach of using a task queue (existing Celery infra), iterating over historical messages and creating indexing tasks.

“Sharding”

Source: Nidhi Gupta’s Medium - ElasticSearch Sharding

Discord wanted to avoid some common (Github managed fine though?) ES problems related to large clusters, and so, they performed sharding on the application layer (Discord backend) instead of letting ES handle it (like above).

Their application-level “shard” key comprises of the cluster key + index key (we’ll get to the index later) which helps locate where a discord server’s messages are stored.

This shard-to-discord-server mapping is persisted on Cassandra as a source-of-truth and cached on Redis for workers to access quickly.

We know how these shards work, but how are they allocated?

Redis has the concept of Sorted Sets to sort a set of items based on said metric, which in this case would be load. Every time a new mapping is made (discord server is allocated), the load metric stored is incremented. Whenever a new server needs allocation, the shard with the lowest load is used :)

Implementation

Source: Discord Eng. Blog

While Discord goes into detail on their implementation, I’ll highlight the key deets that you subbed to this juicy newsie for.

Making search powerful

A raw message only allows for basic text search. Discord is feature rich, so they wrote a search index schema taking into account message metadata like file attachment names, mentions (user tags), etc. FYI - Their current Search API is extremely powerful and feature-rich!

Source: Discord Eng. Blog - Example of metadata mapping

No timestamps are indexed

No timestamps are indexed… but they are important for sorting and filtering, no? Not if we use Snowflake IDs :) Look at my short article on snowflakes for a quick debrief. Not indexing timestamps and performing filtering on app. layer saves on elasticsearch disk space (saves money).

Indexed vs Returned fields

ES uses Inverted Index which essentially maps content to its location or IDs. Although the index contains data like attachment names, the only fields that are returned (and thus, stored) are the message ID, channel ID and server ID to avoid data duplication. While this means the raw message will need to be fetched from Cassandra, it also means saving money on ES storage costs!

Indexing massive servers

We already learned a little bit about historical indexing or indexing a server’s existing messages. At an indexing rate of 500 messages per job, indexing massive servers would take donkey’s years. So, the indexing is split into two phases: “initial” (high priority to index first 7 days of messages) and “deep” (low priority to index the rest).

Ending Notes

I’d recommend reading this part about infrastructure reliability metrics yourself (which I cut out in the interest of keeping this newsie short).

Source: Discord Eng Blog - Healthy garbage coll. run time

Discord hasn’t come out with any article on how they index trillions of messages, so seems like this implementation scaled quite well! I do wonder if they still lazily index messages the same way as when they had billions of messages?

Stay busy, cya next time!