The Blueprint
Posts
How Uber improved Cassandra reliability by 1000%

How Uber improved Cassandra reliability by 1000%

Omkaar Kamath
August 23, 2023

🫡 GM Busy Engineers. This blog covers the challenges Uber faced while self-managing an ungodly amount of Cassandra nodes. The sheer scale at which Cassandra is needed and used is always host to interesting problems.

Uber is hiring!

Who is Cassandra??

Source: Apache Cassandra Docs

I’d recommend reading my previous article on Discord using Cassandra as a primer.

Cassandra is a db used to handle massive amounts of data while ensuring high availability (minimal downtime) and fault tolerance (if one node dies, the DB still works).

It has fast reads and even faster writes and can be easily scaled by adding more servers (horizontal scaling), making it suitable for scenarios where traditional relational databases might struggle with the volume of data and traffic.

Cassandra at Uber

Source: Business Insider

Usually, when a company uses Cassandra, they use GCP-hosted Cassandra where Google runs the DB cluster and let you store your data on it… making reliability and maintenance a worry of yesterday.

However, at Uber’s scale, it is probably cheaper and easier to run Cassandra as a self-managed service where Uber hosts it and hires a team of engineers to manage it (+ higher customization to Uber’s problems is possible).

Source: Uber Blog - Uber’s Cassandra Deployment

I have orchestrated Postgres-as-a-self-managed-service before and the reduction in cost (30%ish) + gain in performance is awesome BUT it also introduces a lot more overhead and headaches 🤦‍♂️

The Problem

When something runs at the scale of…

10s of millions of queries per second on Petabytes of data (1 PB = 10^6 GB)
10s of 1000s of Cassandra nodes with 1000s of unique keyspaces (databases)
100s of unique clusters across multiple regions

… it is bound to face reliability and maintenance issues. Even Discord could not handle it.

Uber goes into a lot of detail about three different ways they improved reliability of Cassandra, but I will cover only my fav (based on elegance) one and encourage you to read the full article.

Sluggish anti-entropy

It is a known fact that entropy always increases, Cassandra thinks otherwise!

Let me explain… Anti-entropy = reducing chaos and ensuring data consistency across all Cassandra nodes. Uber identified their top causes of data inconsistencies:

Frequent data deletions - Cassandra handles deletions by adding a special tombstone marker on data to indicate its garbage and later, runs compaction processes which delete all tombstoned data from all nodes. Any failures in these processes can cause data inconsistencies (data stored is different across master and replica nodes)
Downed nodes - My guess here is downed nodes come back up and are not synced correctly due mistakes in the hint file mechanism (file with records of all updates made to data when a node is down)

These can be fixed using a repair mechanism - similar to something that resolves git merge conflicts automagically.

While there are external utilities that can help schedule repairs for these inconsistencies, Uber embraced the more elegant solution: adding this within Cassandra itself.

The solution? Implementing a repair scheduler on a dedicated thread pool within Cassandra. A repair involves loThis scheduler maintains a special table in the system_distributed keyspace to track repair history for all nodes. It records when each node was last repaired, among other details.

The scheduler selects nodes that were repaired the earliest and coordinates the process to ensure all tables are repaired. Apparently, this was so reliable that it does not require any manual intervention (ultimate goal of any SRE)!

Thank you for reading this edition of theblueprint.dev. Stay busy and have a good one!