GitHub claims source code search engine is a game changer
When grep isn't good enough, try Blackbird
GitHub has a lot of code to search – more than 200 million repositories – and says last November's beta version of a search engine optimized for source code that has caused a "flurry of innovation."
GitHub engineer Timothy Clem explained that the company has had problems getting existing technology to work well. "The truth is from Solr to Elasticsearch, we haven't had a lot of luck using general text search products to power code search," he said in a GitHub Universe video presentation. "The user experience is poor. It's very, very expensive to host and it's slow to index."
In a blog post on Monday, Clem delved into the technology used to scour just a quarter of those repos, a code search engine built in Rust called Blackbird.
Blackbird currently provides access to almost 45 million GitHub repositories, which together amount to 115TB of code and 15.5 billion documents. Shifting through that many lines of code requires something stronger than grep, a common command line tool on Unix-like systems for searching through text data.
Using ripgrep on an 8-core Intel CPU to run an exhaustive regular expression query on a 13GB file in memory, Clem explained, takes about 2.769 seconds, or 0.6GB/sec/core.
"We can see pretty quickly that this really isn’t going to work for the larger amount of data we have," he said. "Code search runs on 64 core, 32 machine clusters. Even if we managed to put 115TB of code in memory and assume we can perfectly parallelize the work, we’re going to saturate 2,048 CPU cores for 96 seconds to serve a single query! Only that one query can run. Everybody else has to get in line."
- GitHub CEO says EU AI Act shouldn't apply to open source devs
- Mozilla, like Google, is looking ahead to the end of Apple's WebKit rule
- Microsoft, GitHub, OpenAI urge judge to bin Copilot code rip-off case
- ChatGPT (sigh) the fastest-growing web app in history (sigh) claim analysts
At 0.01 queries per second, grep was not an option. So GitHub front-loaded much of the work into precomputed search indices. These are essentially maps of key-value pairs. This approach makes it less computationally demanding to search for document characteristics like the programming language or word sequences by using a numeric key rather than a text string.
Even so, these indices are too large to fit in memory, so GitHub built iterators for each index it needed to access. According to Clem, these lazily return sorted document IDs that represent the rank of the associated document and meet the query criteria.
To keep the search index manageable, GitHub relies on sharding – breaking the data up into multiple pieces using Git's content addressable hashing scheme and on delta encoding – storing data differences (deltas) to reduce the data and metadata to be crawled. This works well because GitHub has a lot of redundant data (e.g. forks) – its 115TB of data can be boiled down to 25TB through deduplication data-shaving techniques.
The resulting system works much faster than grep – 640 queries per second compared to 0.01 queries per second. And indexing occurs at a rate of about 120,000 documents per second, so processing 15.5 billion documents takes about 36 hours, or 18 for re-indexing since delta (change) indexing reduces the number of documents to be crawled.
GitHub Code Search is presently in beta testing. ®