- The Blueprint
- Posts
- How Github's load balancer handles 5B+ requests / day
How Github's load balancer handles 5B+ requests / day
š«” GM Busy Engineers. Todayās topic deep dives into an integral part of most systems (and consequently system interviews): the Load Balancer (LB). Seemingly a simple topic, there are many caveats and details that come into play when making LBs scale-ready. Checkout the original article.
Also, Githubās hiring senior engineers.
Source: Github Eng Blog
The Problem
Github serves billions of connections each day. In the past, Github approached distributing this insane load through vertical scaling methodsā¦ A.K.A a few large machines running HAProxy (similar to NGINX).
Githubās bare-metal load balancers were setup in a way that made it tough to support horizontal scaling. This is a huge scale / maintainability bottleneck for a high-growth company like Github (back in 2016).
This leads us to theirā¦
Ideal requirements
Some expectations were that the LB wouldā¦
Run on commodity (widely-available and cheap) hardware
Scale horizontally and supports high availability = Still functions if one machine goes down
Support connection draining = Redirecting incoming connections to another server for maintenance
Be resilient to typical DDoS and other attacks
The Solution
āStretching the IPā
Source: Kinsta, every domain is mapped to one or more IPs
Usually in large multi-server applications, a single server is assigned a single IP and a DNS or domain name (like theblueprint.dev) will be mapped to multiple IPs (round-robin DNS) which helps balance load across multiple servers!
Source: Apache, this demonstrates round-robin DNS in action
Github believes that DNS entries are cached on browsers and itās TTL (time to expire cache item) is often ignored. This mis-caching leads to cases where a server fails but the domain name still resolves to that serverās IP which leads to the user seeing a non-meaningful ācanāt resolveā error.
Mozilla showing a non-resolvable error
To solve this, Github engineers looked into ECMP routing (Equal-Cost Multi-Path). It allows a single IP address to be served by multiple physical machines. Using consistent hashing and hashing on certain attributes of incoming packets (client IP), it sends all incoming packets from the same connection to one physical machine.
However, a challenge with this is when one server failsā¦ this will trigger a rehash event causing all the active connections to that server to eventually get terminated.
So, Github Eng triedā¦
Splitting L4 and L7 tiers
L4 (Layer 4 of OSI) handles protocol-level (like TCP) comms where as L7 (Layer 7) handles application-level comms. In this solution, the L4 tier uses ECMP (discussed above) to divide traffic among multiple L4 load balancers.
Source: HAProxy
These L4 "director" hosts manage connection state and forward traffic to the L7 tier. The L7 tier, known as "proxy" hosts, use software like HAProxy to handle connections and send them to backend servers.
The benefit of this split is that the L4 load balancers can be taken out gracefully without disrupting existing connections, as the connection state remains intact. This is helpful for maintenance and upgrades.
A big drawback with this design is the added complexity to DDoS mitigation due to the director tier requiring connection state.
Designing a better director
Having a stateful director tier causes added complexity? Solution: Make director tier stateless.
With this change, Github ensures that when a user is downloading a large repo on a slow connection and some director or proxy nodes are removed for maintenance, the user will not lose their connection / download progress!
I wonāt go into the complexities of how this was done but you can check it out here.
Ending Notes
This article took a long time to synthesize and write given the highly technical and niche nature of it (networking). I would love some feedback on if it was clear or how I can improve future highly technical pieces like this (reply or DM).
Goodbye and stay busy!