Websites clamp down as creepy AI crawlers sneak around for snippets

Shrinks training pool, but hurts services like the Internet Archive

The internet is becoming significantly more hostile to webpage crawlers, especially those operated for the sake of generative AI, researchers say.

The Data Provenance Initiative in their study titled "Consent in Crisis" looked into the domains scanned in three of the most important datasets used for training AI models. Training data usually includes publicly available info from all sorts of websites, but giving the public access to data isn't the same as giving consent for collecting it automatically using a crawler.

Crawling for data, also known as scraping, has been around much longer than generative AI, and websites already had rules on what crawlers could and couldn't do. These rules are contained in the robots.txt standard (basically an honor code for crawlers) as well as websites' terms and conditions.

The researchers examined the whole datasets – C4, Dolma, and RefinedWeb – as well as their most used domains. The data shows that websites have reacted to the introduction of AI crawlers in 2023.

Specifically, OpenAI's GPTBot and Google's Google-Extended crawlers immediately triggered websites to start changing their robots.txt restrictions. Today, between 20 and 33 percent of the top domains have enacted complete restrictions on crawlers, as opposed to just a few percent in early 2023.

Across the whole body of domains, only 1 percent enforced restrictions prior to mid-2023; now 5-7 percent have done so.

Some websites are also changing their terms of service to completely ban both crawling and using hosted content for generative AI, though the change isn't nearly as drastic as it is with robots.txt.

When it comes to whose crawlers are getting blocked, OpenAI is by far in the lead, having been banned from 25.9 percent of top sites. Anthropic and Common Crawl have been kicked out of 13.3 percent, while crawlers from Google, Meta, and others are restricted at less than 10 percent of domains.

As for what sites are putting up barriers to AI crawlers, it's largely news sites. Among all domains, news publications were by far the most likely to have terms of service (ToS) and robots.txt settings restricting AI crawlers. However, for the top domains specifically, social media platforms and forums (think Facebook and X) were just as likely to restrict crawlers via the terms of service as news publications.

New rules on crawling needed to fix this mess

Although it's clear lots of websites don't want their content being scraped for use in AI, the Data Provenance Initiative says they're not communicating that effectively.

Part of this is down to the restrictions in robots.txt and the ToS not lining up. 34.9 percent of the top training websites make it clear in the ToS that crawling isn't allowed, but fail to mirror that in robots.txt. On the other hand, websites with no ToS at all are surprisingly likely to set up partial or complete blocks on crawlers.

And when crawling is banned, websites tend to just ban OpenAI, Common Crawl, and Anthropic. The study also found some websites fail to correctly identify and restrict certain crawlers. 4.5 percent of sites banned Anthropic-AI and Claude-Web instead of Anthropic's actual crawler ClaudeBot.

Plus, there are bots for collecting training materials but also those for grabbing up-to-date info, and the distinction might not always be clear to website operators. So while GPTBot is banned on some domains, ChatGPT-User isn't, even though they're both used for crawling.

Obviously, sites locking down their data will negatively impact AI model training, especially since the websites most likely to crack down tend to have the highest quality data. But the team points out that crawlers are used by academia and nonprofits like the Internet Archive, and are getting caught in the crossfire.

The study also brings up the possibility that AI firms might have wasted their time crawling so hard they're getting banned. While almost 40 percent of the top domains used in the three datasets were news-related, over 30 percent of ChatGPT inquiries were for creative writing, compared to about 1 percent that concerned news.

Other common requests were for translation, coding assistance, general information, and sexual roleplay, which was in second place.

The researchers say the traditional structure of robots.txt and ToS aren't capable of accurately defining rules in the age of AI. Part of the problem is that enforcing a total ban is the easiest solution, since robots.txt is mostly useful for blocking specific crawlers rather than communicating certain rules, like what crawlers are allowed to do with collected data.

Until that happens, however, the current trajectory of AI data scraping could affect how the web is structured, which is likely to be less open than it was before. ®

More about

TIP US OFF

Send us news


Other stories you might like