This article is more than 1 year old

In Search of Lost Time: GNU Grep 3.7 released with fix for 'extreme performance degradation'

Most searches were fine, but certain test cases now take 'seconds, not days'

GNU grep 3.7 has been released with a fix for a bug causing "extreme performance degradation" in certain types of search.

This search tool, which looks for character patterns in files, is a core utility on Linux and other Unix-like operating systems. In November last year, a user noted: "I have a use case where I run grep with a large number of search patterns on a large text file. It works well with grep-3.3, but with grep-3.4 it quickly burned through GBs of memory and almost locked up my system due to swapping ... even with just 30,000 patterns it exceeds the limit of 5GB."

By contrast, grep 3.3 used "just a few 100MB."

Grep maintainer Jim Meyering took a look and identified a bug that "triggers excessive hash collisions ... that made the new pattern-preprocessing phase take O(N^2) time for N patterns."

In the announcement for version 3.7, Meyering said that an example use (below) now takes "seconds, not days."

: | grep -Ff <(seq 6400000 | tr 0-9 A-J)

The bug only shows up in cases where "too many patterns hashed to too few buckets."

Grep is normally remarkably fast and a much better experience than, for example, the agony of attempting to search a large directory for a string in Windows Explorer.

In 2010, GNU grep author Mike Haertel posted about "why GNU grep is fast."

Haertel said that it avoids looking at every input byte by skipping ahead "whenever it finds a non-matching character," avoids copying data, and executes "fewer than 3 x86 instructions executed for each input byte it actually looks at."

The current documentation explains the algorithms used (Boyer-Moore and Aho-Corasick) and acknowledges that while normally efficient, "it can be quite slow in some cases."

Certain regular expressions can result in "using an algorithm that is exponential in the worst case." Case-insensitive searches and multi-byte characters can also slow it down.

Those in search of a faster grep may want to look at ripgrep, written in Rust primarily by Andrew Gallant, and which according to his tests is generally faster than grep, for reasons including always-on Unicode support.

Grep normally works well though and in the event of performance issues it might be worth checking out version 3.7. That said, it takes a while for new versions of grep to filter through to distros: the version in our Windows Subsystem for Linux, for example, is 2.25.

Hat-tip to Michael Larabel at Phoronix who spotted the fix.®

More about

TIP US OFF

Send us news


Other stories you might like