• The Blueprint
  • Posts
  • How Github's search scales over 300M repositories

How Github's search scales over 300M repositories

and why their earlier solutions failed!

Greetings fellow busy engineers! Have you tried out Github’s code search recently? The speed and performance is mind-boggling, especially given the amount of repos it covers.

Searching “var“ over 112k repos!

To understand why the pic above is a big deal, we need to take at the challenge they face.

Let’s talk numbers and constraints…

Github’s scale described in two stats

Here are the constraints they face:

  • Source code ≠ natural language, search optimization techniques like stemming (ex. “playing” → “play”) wouldn’t work

  • Newly merged code (termed “push events”) has to be searchable within minutes

  • 95th percentile search times should be < 1s

 
Github’s Search has a fascinating history…

  1. Their first solution (pre-2010) used Solr instances that only searched public repos. The solution for private repo search was git grep, which simply scanned each repo document individually. git grep’s inefficient approach made it bad for large repos.

  2. The second solution used a large Elasticsearch cluster indexing 53B source files. By large, I mean 162 nodes, 5184 vCPUs, 40TB of RAM large. However, a limitation was characters like $ & + ^ | ~ < > ( ) { } [ ] @ were ignored, which is no bueno for a code search!!

  3. This brings us to the current solution: Project Blackbird!

 
Debriefing Operation Blackbird

Let’s ignore the FBI-ish name and peek under the hood of their custom code search engine. Keep in mind that the search engine has two components: Indexing and Querying. I cover the querying layer here but skip the (spicy but) lengthy indexing layer deets.

Goals for the ideal solution:

  • Index all source code on GitHub

  • Support incremental indexing and document deletion

  • Fast exact-match and regex searches (for superusers like us ;)

  • p95 query times < 1s

  • Around the same resource consumption as Elasticsearch clusters

How did they achieve these?

  • Shard by Git blob object ID: Allows each shard to have evenly distributed documents → better query performance

  • Delta indexing: Enables indexing unique content and reduce redundancy in input → reduces index size and resource consumption. 115TBs of content ⏩ 28TBs of unique content!

  • Ingesting based on Minimum spanning tree of a graph of repository similarity → lowers resource consumption on indexing

  • Sparse gram indices - A very complicated iterative algorithm to generate search indices!


How do they handle a search query?

Architectural Diagram

Let’s break down this diagram:

  • Client: Sends out the API call with search query

  • Blackbird Query Service: Parses query into an Abstract Syntax Tree (AST) (example below) to resolve languages, permissions and scopes. Please reply to this email if you want another newsletter to cover how ASTs work! This AST is fanned-out to each Search Engine Cluster shard. Finally, the top 100 results from all shards are returned after aggregation, re-sorting by score, filtering for permissions

And(
    Owner("rails"),
    LanguageID(326),
    Regex("arguments?"),
    Or(
        RepoIDs(...),
        PublicRepo(),
    ),
)
  • Single Search Engine shard: Further breaks down the query for search on that specific shard and returns relevant results.

  • Redis: Caches repo permissions and stores rate limiting quotas

  • Git: Validates blob accessibility

Checkout the full article here!

Thank you for your time! Since this is our first edition, I would love feedback and your thoughts on how we can improve (simply reply to this email)!

P.S. if you received any value from this, we would really appreciate a social share (please tag us)!!