- The Blueprint
- Posts
- How Twitter creates sortable and unique IDs for 200B Tweets per year
How Twitter creates sortable and unique IDs for 200B Tweets per year
If you have used MongoDB extensively, you know that the IDs assigned to each document (similar to a row in SQL) are sortable. Let’s dive into the how and why!
Credit: MongoDB
Why does this even matter?
At a scale where you need a thousand new IDs per second, your choice of ID can make or break your system. Don’t believe me? Have you heard of…
Twitpocalypse!
In 2009, Twitter’s Tweet ID was a 32 bit long integer (2^32 or 4,294,967,296 combinations), set to be sequential (every subsequent request gets ID n+1) to make sorting easy.
With millions of new tweets per day in 2009, they were approaching their max tweet ID limit which caused a massive panic at TwitterHQ.
Credit: Ars Technica
A second chance
They did end up updating their IDs to 64-bit long sequential integer IDs. However, after facing some (unrelated) scale issues, they wanted to switch to Cassandra which has no built-in way of generating IDs. This gave them another shot at choosing a good ID format and this time, they learned from their past.
How Twitter solved for IDs?
Let’s first understand what a good ID generator is at scale. Twitter defines it as:
They considered Flickr’s ID generation approach (covered in a future article), UUIDs (128-bit IDs but Twitter needs 64-bit) and Zookeeper’s Sequence nodes (apparently not highly available).
They ended up building a custom solution called Snowflake. A service to create roughly-sorted 64-bit ids constructed using timestamp, worker ID and a random sequence number!
Credit: Atakan Demircioğlu on Medium
This format ended up being used by Instagram and even Twitter’s competitor, Mastodon (here is the exact code!).
The roughly sortable constraint is officially called k-sorting. For Twitter, the k = 1 second, which means tweets posted within a second of one another will be within a second of one another in the id space too!
MongoDB’s slightly different idea
Since I mentioned MongoDB before, I wanted to share their IDea :)
This information is right from their docs, their ID consists of:
4-byte timestamp, representing the ObjectId's creation, measured in seconds since the Unix epoch.
5-byte random value generated once per process. This random value is unique to the machine and process.
3-byte incrementing counter, initialized to a random value.
Credit: TechKranti
As a parting gift for this edition, I leave you with all the messages leading up to Twitter’s back-then new ID system! Until next time 👋