Reddit hopes robots.txt tweak will do the trick in scaring off AI training data scrapers

Pay up or go away, pretty please?

For many Reddit has become the go to repository of community and crowdsourced knowledge, a fact that has no doubt made it a prime target for AI startups desperate for training data.

This week, Reddit announced it would be introducing measures to prevent unauthorized scraping by such organizations. These efforts will include an updated robots.txt — a file found on most websites that provides directions to web crawlers on what they can and can't index — "in the coming weeks." If you're curious you can find Reddit's current robots.txt here.

It should be noted that robots.txt can't force scrapers to do anything; the file's contents are more like guidelines or firm requests. Web crawlers can be made to ignore them, so Reddit says it will continue to rate limit and/or block rogue bots – presumably that includes bad ones that ignore robots.txt – from accessing the site.

Indeed, crawlers that shun robots.txt risk getting blocked entirely, if possible, from sites in general by their administrators.

These measures, vague as they are at the moment, appear to be targeted specifically at those accessing Reddit for commercial gain. The site says that "Good faith actors — like researchers and organizations such as the Internet Archive — will continue to have access to Reddit content for non-commercial use."

The announcement comes just weeks after Reddit unveiled a fresh public content policy, which it spun as a way to more transparently communicate how user data is used and protect user privacy.

"We see more and more commercial entities using unauthorized access or misusing authorized access to collect public data in bulk, including Reddit public content," the site said.

It seems Reddit execs would much rather interested parties pay it for curated access to its crowdsourced hive mind of knowledge, opinion, trolling, and karma farming, as the announcement ends with a sales pitch for its data access plans.

As we've previously discussed, training large language models, like GPT-4, Gemini, or Claude require a prodigious amount of data. Meta's relatively small Llama3 8B model used some 15 trillion tokens.

Because of this, supplying AI training data used to build these models has become a lucrative business proposition. Last month Scale AI — which sells AI data services including pre-labeled datasets — saw its valuation soar to nearly $14 billion amid a $1 billion funding round led by Nvidia, Amazon, and Meta.

Meanwhile, this week also saw the formation of an AI data trade group called the Dataset Providers Alliance. The group's members include Rightsify, vAIsual, Pixta AI, Datarade, Global Copyright Exchange, Calliope Networks, and Ado.

Naturally, Reddit is keen to cash in on this demand, having already announced an agreement to sell API access to Google in a deal reportedly worth $60 million a year. The Front Page of the Internet last month reached a similar agreement with OpenAI, though the terms of the deal weren't disclosed.

How useful Reddit's data actually is has, however, been called into question in recent weeks after Google started citing obvious troll posts in its AI generated answers. In one case the search engine suggested adding "non-toxic glue" to pizza sauce to keep the cheese from sticking.

The Register reached out to Reddit for comment on its efforts to block rogue web scrapers and on its future plans. ®

More about


Send us news

Other stories you might like