Anubis guards gates against hordes of LLM bot crawlers

Using proof of work to block the web-crawlers of 'AI' companies

Updated Anubis is a sort of CAPTCHA test, but flipped: instead of checking visitors are human, it aims to make web crawling prohibitively expensive for companies trying to feed their hungry LLM bots.

It's is a clever response to a growing problem: the ever expanding list of companies who want to sell "AI" bots powered by Large Language Models (LLMs). LLMs are built from a "corpus," a very large database of human-written text. To keep updating the model, an LLM bot-herder needs fresh text for their "corpus."

Anubis is named after the ancient Egyptian jackal-headed god who weighed the heart of the dead, to determine their fitness. To protect websites from AI crawlers, the Anubis software weighs their willingness to do some computation, in what is called a proof of work challenge.

A human visitor merely sees a jackal-styled animé girl for a moment, while their browser solves a cryptographic problem. For companies running large-scale bot farms, though, that means the expensive sound of the fans of a whole datacenter spinning up to full power. In theory, when scanning a site is so intensive, the spider backs off.

There are existing measures to stop search engines crawling your site, such as a robots.txt file. But as Google's explanation says, just having a robots.txt file doesn't prevent a web spider crawling through the site. It's an honor system, and that's a weakness. If the organization running the scraper chooses not to honor it – or your intellectual property rights – then they can simply take whatever they want, as often as they want.

Repeat visits are a big problem. It's cheaper to repeatedly scrape largely identical material than it is to store local copies of it — or as Drew DeVault put it, please stop externalizing your costs directly into my face.

It was already a serious problem a year ago, when The Register reported on ClaudeBot crawling a million times in one day. A year later, and despite signing deals, Reddit sued Anthropic over it. It doesn't just affect forums and the like: LWN is facing the problem. Tech manual publishing tool ReadTheDocs reported one crawler downloading 73 terabytes in a month.

Jedi mind trick

Scholars sneaking phrases into papers to fool AI reviewers

READ MORE

The underlying technology is not new. The idea of proof-of-work as an anti-spam measure goes back to Hashcash in 1997, to which The Reg referred back in 2013. In a Hacker News comment, Iaso also gave due credit:

I was inspired by Hashcash, which was proof of work for email to disincentivize spam. To my horror, it worked sufficiently for my git server so I released it as open source. It's now its own project and protects big sites like GNOME's GitLab.

Other comments detail how the proof of work is done, and we appreciated this note:

The second reason is that the combination of Chrome/Firefox/Safari's JIT and webcrypto being native C++ is probably faster than what I could write myself. Amusingly, supporting this means it works on very old/anemic PCs like PowerMac G5 (which doesn't support WebAssembly because it's big-endian).

Iaso says that Anubis works, and that post contains an impressive list of users, from UNESCO to the WINE, GNOME and Enlightenment projects. Others agree too. Drew DeVault, quoted above, deployed Anubis in April to protect his SourceHut code forge. The next month, describing Anubis as "the nuclear option," he switched to go-away, whose README says it "uses conventional non-nuclear options."*

There are other such measures. Nepenthes is an LLM bot tarpit: it generates endless pages of link-filled nonsense text, trapping bot-spiders. The Quixotic and Linkmaze tools work similarly, while TollBit is commercial.

Some observers have suggested using the work performed by the browser to mine cryptocurrency, but that risks being deemed malicious. Coinhive tried it nearly a decade ago, and got blocked as a result. Here, we respect Iaso's response:

It's to waste CPU cycles. I don't want to touch cryptocurrency with a 20 foot pole. I realize I'm leaving money on the table by doing this, but I don't want to alienate the kinds of communities I want to protect.

Others, such as the Reg FOSS desk's favorite internet guru Jamie Zawinski, are less impressed:

I am 100 percent allergic to cutesey kawaii bullshit intermediating me and my readers with some maybe-cryptocurrency nonsense, so fuck to all of the no of that.

His prediction is pessimistic:

Proof of work is fundamentally inflationary, wasteful bullshit that will never work because the attacker can always outspend you.

It is wasteful – that's the point – but then, so is the vast traffic generated by these bot-feeding harvesters. Some would argue that LLM bots themselves are an even vaster waste of resources and energy, and we would not disagree. As such, we're in favor of anything that hinders them. ®

Updated to add:

* In the original article, we failed to note that DeVault switched to go-away after initially deploying Anubis to protect his SourceHut code forge.

More about

TIP US OFF

Send us news


Other stories you might like