Software

AI + ML

AI models face collapse if they overdose on their own output

Recursive training leads to nonsense, study finds


Researchers have found that the buildup of AI-generated content on the web is set to "collapse" machine learning models unless the industry can mitigate the risks.

The University of Oxford team found that using AI-generated datasets to train future models may generate gibberish, a concept known as model collapse. In one example, a model started with a text about European architecture in the Middle Ages and ended up – in the ninth generation – spouting nonsense about jackrabbits.

In a paper published in Nature yesterday, work led by Ilia Shumailov, Google DeepMind and Oxford post-doctoral researcher, found that an AI may fail to pick up less common lines of text, for example, in training datasets, which means subsequent models trained on the output cannot carry forward those nuances. Training new models on the output of earlier models in this way ends up in a recursive loop.

"Long-term poisoning attacks on language models are not new," the paper says. "For example, we saw the creation of click, content and troll farms, a form of human 'language models' whose job is to misguide social networks and search algorithms. The negative effect that these poisoning attacks had on search results led to changes in search algorithms. For example, Google downgraded farmed articles, putting more emphasis on content produced by trustworthy sources, such as education domains, whereas DuckDuckGo removed them altogether. What is different with the arrival of LLMs is the scale at which such poisoning can happen once it is automated."

In an accompanying article, Emily Wenger, assistant professor of electrical and computer engineering at Duke University, illustrated model collapse with the example of a system tasked with generating images of dogs.

"The AI model will gravitate towards recreating the breeds of dog most common in its training data, so might over-represent the Golden Retriever compared with the Petit Basset Griffon Vendéen, given the relative prevalence of the two breeds," she said.

"If subsequent models are trained on an AI-generated data set that over-represents Golden Retrievers, the problem is compounded. With enough cycles of over-represented Golden Retriever, the model will forget that obscure dog breeds such as Petit Basset Griffon Vendéen exist and generate pictures of just Golden Retrievers. Eventually, the model will collapse, rendering it unable to generate meaningful content."

While she concedes an over-representation of Golden Retrievers may be no bad thing, the process of collapse is a serious problem for meaningful representative output that includes less-common ideas and ways of writing. "This is the problem at the heart of model collapse," she said.

One existing approach to mitigate the problem is to watermark AI-generated content. However, these watermarks can be easily removed from AI-generated images. Sharing watermark information also requires considerable coordination between AI companies, "which might not be practical or commercially viable," Wenger said.

Shumailov and colleagues say that training a model with AI-generated data is not impossible, but the industry needs to establish an effective means of filtering data.

"The need to distinguish data generated by LLMs from other data raises questions about the provenance of content that is crawled from the internet: it is unclear how content generated by LLMs can be tracked at scale," the paper says.

"One option is community-wide coordination to ensure that different parties involved in LLM creation and deployment share the information needed to resolve questions of provenance. Otherwise, it may become increasingly difficult to train newer versions of LLMs without access to data that were crawled from the internet before the mass adoption of the technology or direct access to data generated by humans at scale."

Far be it from The Register to enjoy the vantage point of hindsight, but maybe somebody should have thought about this before the industry – and its investors – bet the farm on LLMs. ®

Send us news
140 Comments

AI-pushing Adobe says AI-shy office workers will love AI if it saves them time

knowledge workers, overwhelmed by knowledge tasks? We know what you need

The future of AI/ML depends on the reality of today – and it's not pretty

The return of Windows Recall is more than a bad flashback

Have we stopped to think about what LLMs actually model?

Claims about much-hyped tech show flawed understanding of language and cognition, research argues

AI firms propose 'personhood credentials' … to fight AI

It's going to take more than CAPTCHA to prove you're real

Canadian artist wants Anthropic AI lawsuit corrected

Tim Boucher objects to the mischaracterization of his work in authors' copyright claim

We're in the brute force phase of AI – once it ends, demand for GPUs will too

Gartner thinks generative AI is right for only five percent of workloads

AI bills can blow out by 1000 percent: Gartner

Preventing that is doable, but managing what happens when AI upsets people is hard

Defense AI models 'a risk to life' alleges spurned tech firm

Chatterbox Labs CEO claims Chief Digital and Artificial Intelligence Office unfairly cancelled a contract then accused him of blackmail

Begun, the open source AI wars have

This is going to be ugly. Really ugly

AMD sharpens silicon swords to take on chip and AI rivals

CEO Lisa Su sets sights on being best in GPUs, CPUs, FPGAs, everything... as Intel struggles

AI giants pinky swear (again) not to help make deepfake smut

Oh look, another voluntary, non-binding agreement to do better

Win 11 refreshes delayed, say PC makers – and here's why

Oh and about those AI computers... analysts reckon there are still no killer apps or convincing use cases