GitHub dumps frustrating code search engine for Rust-powered Blackbird
Here's hoping for fewer head-desk moments for devs
GitHub's reworked Rust-based code search engine entered general availability on Monday, promising faster, more comprehensive explorations of software repositories.
The revision, dubbed Blackbird internally, has been three years in the making, and is part of the corporation's enduring effort to make text-based search techniques more effective on queries of computer code.
"Our goal with the new code search and code view is to enable developers to quickly search, navigate and understand their code, put critical information into context, and ultimately make them more productive," said Colin Merkel, a GitHub software engineer, in an announcement.
Founded in 2008, GitHub initially used Apache Solr to handle its code search. Then, after Solr was folded into Lucene, the collaborative code biz built a new search service using Elasticsearch in 2013. Outages followed and by 2020 – two years after Microsoft acquired the company – work on Blackbird commenced.
The goals for the project were: to index all source code on GitHub; to support incremental indexing and document deletion, and to provide fast exact-match and regex queries (< 1 second for 95 percent of users on global queries, and faster still for more narrowly scoped queries); to integrate without GitHub code info; and to do so without expanding resource demands on GitHub's existing Elasticsearch cluster.
- GitHub debuts pedigree check for npm packages via Actions
- Worried about the security of your code's dependencies? Try Google's Deps.dev
- Is it time to tip open source developers? Here's one way to do it
- GitHub publishes RSA SSH host keys by mistake, issues update
An off-the-shelf tool capable of the above did not exist, so GitHub committed to Blackbird, written in Rust, as the biz discussed in February. The resulting system can manage about 640 queries per second, compared to about 0.01 queries per second for ripgrep, thanks to precomputed search indices that map numeric keys to values and other architectural improvements. And it can index at a rate of roughly 120,000 documents per second.
"It is incredibly fast (about twice as fast as the old code search), far more capable (supporting substring queries, regular expressions, and symbol search), and understands code, putting the most relevant results first," explained Merkel.
Beyond the technical fiddling necessary to index and query 45 million repositories (which excludes many redundant forks), GitHub's new code search engine has been framed with search interface improvements that show suggestions and competitions, and with a redesigned code view that brings search, browsing, and code navigation together.
The result finds specific text across repos works rather well. Take trying to locate values associated with the "memory" key in YAML configuration files for a Kubernetes cluster. GitHub code search makes it easy to focus only on YAML files.
That kind of precise filtering is also useful when trying to identify which particular part of an application produced a specific error message.
Merkel says that GitHub's goal with the new code search and code view is to help developers find important information scattered across their codebase, to contextualize that information, and to make developers more productive. ®