Software

AI + ML

Bots are overwhelming websites with their hunger for AI data

GLAM-E Labs report warns of risk to online cultural resources


Bots harvesting content for AI companies have proliferated to the point that they're threatening digital collections of arts and culture.

Galleries, Libraries, Archives, and Museums (GLAMs) say they're being overwhelmed by AI bots – web crawling scripts that visit websites and download data to be used for training AI models – according to a report issued on Tuesday by the GLAM-E Lab, which studies issues affecting GLAMs.

GLAM-E Lab is a joint initiative between the Centre for Science, Culture and the Law at the University of Exeter and the Engelberg Center on Innovation Law & Policy at NYU Law.

Based on an anonymized survey of 43 organizations, the report indicates that cultural institutions are alarmed by the aggressive harvesting of their content, which shows no regard for the burden that data-harvesting places on websites.

"Bots are widespread, although not universal," the report says. "Of 43 respondents, 39 had experienced a recent increase in traffic. Twenty-seven of the 39 respondents experiencing an increase in traffic attributed it to AI training data bots, with an additional seven believing that bots could be contributing to the traffic."

The surge in bots that gather data for AI training, the report says, often went unnoticed until it became so bad that it knocked online collections offline.

"Respondents worry that swarms of AI training data bots will create an environment of unsustainably escalating costs for providing online access to collections," the report says.

The institutions commenting on these concerns have differing views about when the bot surge began. Some report noticing it as far back in 2021 while others only began noticing web scraper traffic this year.

Some of the bots identify themselves, but some don't. Either way, the respondents say that robots.txt directives – voluntary behavior guidelines that web publishers post for web crawlers – are not currently effective at controlling bot swarms.

Bot defenses offered by the likes of AWS and Cloudflare do appear to help, but GLAM-E Lab acknowledges that the problem is complex. Placing content behind a login may not be effective if an institution's goal is to provide public access to digital assets. And there may be a reason to want some degree of bot traffic, such as bots that index sites for search engines.

The GLAM-E Lab survey echoes the findings of a similar report issued earlier this month by the Confederation of Open Access Repositories (COAR) based on the responses of 66 open access repositories run by libraries, universities, and other institutions.

The COAR report says: "Over 90 percent of survey respondents indicated their repository is encountering aggressive bots, usually more than once a week, and often leading to slowdowns and service outages. While there is no way to be 100 percent certain of the purpose of these bots, the assumption in the community is that they are AI bots gathering data for generative AI training."

The GLAM-E Lab survey also recalls complaints about abusive bots raised by The Wikimedia Foundation, Sourcehut, Diaspora developer Dennis Schubert, repair site iFixit, and documentation project ReadTheDocs.

Ultimately, the GLAM-E report argues that AI providers need to develop more responsible ways to interact with other websites.

"The cultural institutions that host online collections are not resourced to continue adding more servers, deploying more sophisticated firewalls, and hiring more operations engineers in perpetuity," the report says. "That means it is in the long-term interest of the entities swarming them with bots to find a sustainable way to access the data they are so hungry for." ®

Send us news
30 Comments

AI coding tools make developers slower but they think they're faster, study finds

Predicted a 24% boost, but clocked a 19% drag

Tech to protect images against AI scrapers can be beaten, researchers show

Data poisoning, meet data detox

Meta declines to abide by voluntary EU AI safety guidelines

GPAI code asks for transparency, copyright, and safety pledges

Curl creator mulls nixing bug bounty awards to stop AI slop

Maintainers struggle to handle growing flow of low-quality bug reports written by bots

At last, a use case for AI agents with sky-high ROI: Stealing crypto

Boffins outsmart smart contracts with evil automation

Scholars sneaking phrases into papers to fool AI reviewers

Using prompt injections to play a Jedi mind trick on LLMs

AI agents get office tasks wrong around 70% of the time, and a lot of them aren't AI at all

More fiction than science

C-suite sours on AI despite rising investment, survey finds

Akkodis report suggests people skills may be helpful to bring out the best in AI

Shiny object syndrome spells doom for many AI projects, warns EPA CIO

Chasing the hype without a clear use case? You may crash and burn

Former Google DeepMind engineer behind Simular says other AI agents are doing it wrong

Simular is starting with industries like insurance and healthcare with tons of forms to fill

Perplexity rips another page from the Google playbook with its own browser, Comet

Built on Chromium, ironically

From A2A to MCP, a look at the protocols that might one day help AI automate you out of a job

Tell me, Mr. Smith ... what good is an agent if it's unable to speak?