Cloudflare tightens screws on site-gobbling AI bots
When robots.txt just ain't cutting the mustard
Cloudflare on Monday expanded its defense against the dark arts of AI web scrapers by providing customers with a bit more visibility into, and control over, unwelcome content raids.
The network biz earlier this year deployed a one-click AI bot defense to improve upon the not very effective robots.txt mechanism, a way websites can ask, but not require, bots to behave.
Cloudflare is now upgrading its arsenal with an AI Audit control panel.
The idea is to provide customers with analytics data about crawlers that harvest data for AI training and inference so better decisions can be made about whether to embrace the bots or turn them away.
"Some customers have already made decisions to negotiate deals directly with AI companies," explained Sam Rhea, a member of Cloudflare's emerging technology and incubation team. "Many of those contracts include terms about the frequency of scanning and the type of content that can be accessed. We want those publishers to have the tools to measure the implementation of these deals."
Rhea says the problem is that the emergence of AI bots has made it more complicated to determine whether programmatic access to a website is beneficial or abusive. While they're not conducting a denial of service attack, bots that capture site data to train AI models or serve AI search results can still present a business threat.
"AI Data Scraper bots scan the content on your site to train new LLMs," said Rhea. "Your material is then put into a kind of blender, mixed up with other content, and used to answer questions from users without attribution or the need for users to visit your site."
- Cards Against Humanity deals SpaceX a $15M lawsuit over Texas turf tangle
- IBM quietly axing thousands of jobs, source says
- Heart of glass: Human genome stored for 'eternity' in 5D memory crystal
- No way? Big Tech's 'lucrative surveillance' of everyone is terrible for privacy, freedom
As software developer Simon Willison has described it, AI training is akin to "money laundering for copyrighted data." Because companies like OpenAI and Anthropic do not disclose the training data used to create their models, AI is essentially content laundering. It's similar to a crypto mixer – a process intended to disguise the provenance of cryptocurrency.
Then, there are AI Search Crawler bots that scan content and cite it back in response to search queries. "The downside is that those users might just stay inside of that interface, rather than visit your site, because an answer is assembled on the page in front of them," said Rhea.
That is to say, AI search may not drive traffic to source websites, and thus doesn't provide ad revenue. The issue came up over the summer when iFixit CEO Kyle Wiens objected to data harvesting by Anthropic's crawlers, a situation the AI firm has since addressed.
Rhea argues that allowing AI bots to run rampant threatens the open internet.
"Without the ability to control scanning and realize value, site owners will be discouraged to launch or maintain Internet properties," he said. "Creators will stash more of their content behind paywalls and the largest publishers will strike direct deals. AI model providers will in turn struggle to find and access the long tail of high-quality content on smaller sites."
Enter Cloudflare's AI Audit control panel. The network biz believes companies can use the provided bot analytics to monitor content access deals with AI firms, which it claims are becoming more common, and enforce policies rather than trusting crawlers to obey robots.txt directives. ®