The Blueprint
Posts
How Github's load balancer handles 5B+ requests / day

How Github's load balancer handles 5B+ requests / day

Omkaar Kamath
August 17, 2023

🫡 GM Busy Engineers. Today’s topic deep dives into an integral part of most systems (and consequently system interviews): the Load Balancer (LB). Seemingly a simple topic, there are many caveats and details that come into play when making LBs scale-ready. Checkout the original article.

Also, Github’s hiring senior engineers.

Source: Github Eng Blog

The Problem

Github serves billions of connections each day. In the past, Github approached distributing this insane load through vertical scaling methods… A.K.A a few large machines running HAProxy (similar to NGINX).

Github’s bare-metal load balancers were setup in a way that made it tough to support horizontal scaling. This is a huge scale / maintainability bottleneck for a high-growth company like Github (back in 2016).

This leads us to their…

Ideal requirements

Some expectations were that the LB would…

Run on commodity (widely-available and cheap) hardware
Scale horizontally and supports high availability = Still functions if one machine goes down
Support connection draining = Redirecting incoming connections to another server for maintenance
Be resilient to typical DDoS and other attacks

The Solution

“Stretching the IP”

Source: Kinsta, every domain is mapped to one or more IPs

Usually in large multi-server applications, a single server is assigned a single IP and a DNS or domain name (like theblueprint.dev) will be mapped to multiple IPs (round-robin DNS) which helps balance load across multiple servers!

Source: Apache, this demonstrates round-robin DNS in action

Github believes that DNS entries are cached on browsers and it’s TTL (time to expire cache item) is often ignored. This mis-caching leads to cases where a server fails but the domain name still resolves to that server’s IP which leads to the user seeing a non-meaningful ‘can’t resolve’ error.

Mozilla showing a non-resolvable error

To solve this, Github engineers looked into ECMP routing (Equal-Cost Multi-Path). It allows a single IP address to be served by multiple physical machines. Using consistent hashing and hashing on certain attributes of incoming packets (client IP), it sends all incoming packets from the same connection to one physical machine.

However, a challenge with this is when one server fails… this will trigger a rehash event causing all the active connections to that server to eventually get terminated.

So, Github Eng tried…

Splitting L4 and L7 tiers

L4 (Layer 4 of OSI) handles protocol-level (like TCP) comms where as L7 (Layer 7) handles application-level comms. In this solution, the L4 tier uses ECMP (discussed above) to divide traffic among multiple L4 load balancers.

Source: HAProxy

These L4 "director" hosts manage connection state and forward traffic to the L7 tier. The L7 tier, known as "proxy" hosts, use software like HAProxy to handle connections and send them to backend servers.

The benefit of this split is that the L4 load balancers can be taken out gracefully without disrupting existing connections, as the connection state remains intact. This is helpful for maintenance and upgrades.

A big drawback with this design is the added complexity to DDoS mitigation due to the director tier requiring connection state.

Designing a better director

Having a stateful director tier causes added complexity? Solution: Make director tier stateless.

With this change, Github ensures that when a user is downloading a large repo on a slow connection and some director or proxy nodes are removed for maintenance, the user will not lose their connection / download progress!

I won’t go into the complexities of how this was done but you can check it out here.

Ending Notes

This article took a long time to synthesize and write given the highly technical and niche nature of it (networking). I would love some feedback on if it was clear or how I can improve future highly technical pieces like this (reply or DM).

Goodbye and stay busy!