AI models face collapse if they overdose on their own output
Recursive training leads to nonsense, study finds
Researchers have found that the buildup of AI-generated content on the web is set to "collapse" machine learning models unless the industry can mitigate the risks.
The University of Oxford team found that using AI-generated datasets to train future models may generate gibberish, a concept known as model collapse. In one example, a model started with a text about European architecture in the Middle Ages and ended up – in the ninth generation – spouting nonsense about jackrabbits.
In a paper published in Nature yesterday, work led by Ilia Shumailov, Google DeepMind and Oxford post-doctoral researcher, found that an AI may fail to pick up less common lines of text, for example, in training datasets, which means subsequent models trained on the output cannot carry forward those nuances. Training new models on the output of earlier models in this way ends up in a recursive loop.
"Long-term poisoning attacks on language models are not new," the paper says. "For example, we saw the creation of click, content and troll farms, a form of human 'language models' whose job is to misguide social networks and search algorithms. The negative effect that these poisoning attacks had on search results led to changes in search algorithms. For example, Google downgraded farmed articles, putting more emphasis on content produced by trustworthy sources, such as education domains, whereas DuckDuckGo removed them altogether. What is different with the arrival of LLMs is the scale at which such poisoning can happen once it is automated."
In an accompanying article, Emily Wenger, assistant professor of electrical and computer engineering at Duke University, illustrated model collapse with the example of a system tasked with generating images of dogs.
"The AI model will gravitate towards recreating the breeds of dog most common in its training data, so might over-represent the Golden Retriever compared with the Petit Basset Griffon Vendéen, given the relative prevalence of the two breeds," she said.
"If subsequent models are trained on an AI-generated data set that over-represents Golden Retrievers, the problem is compounded. With enough cycles of over-represented Golden Retriever, the model will forget that obscure dog breeds such as Petit Basset Griffon Vendéen exist and generate pictures of just Golden Retrievers. Eventually, the model will collapse, rendering it unable to generate meaningful content."
- Google keeps the cost of AI search flat, and kids are lovin' it
- Meta claims 'world's largest' open AI model with Llama 3.1 405B debut
- What does Google Gemini do with your data? Well, it's complicated...
- Websites clamp down as creepy AI crawlers sneak around for snippets
While she concedes an over-representation of Golden Retrievers may be no bad thing, the process of collapse is a serious problem for meaningful representative output that includes less-common ideas and ways of writing. "This is the problem at the heart of model collapse," she said.
One existing approach to mitigate the problem is to watermark AI-generated content. However, these watermarks can be easily removed from AI-generated images. Sharing watermark information also requires considerable coordination between AI companies, "which might not be practical or commercially viable," Wenger said.
Shumailov and colleagues say that training a model with AI-generated data is not impossible, but the industry needs to establish an effective means of filtering data.
"The need to distinguish data generated by LLMs from other data raises questions about the provenance of content that is crawled from the internet: it is unclear how content generated by LLMs can be tracked at scale," the paper says.
"One option is community-wide coordination to ensure that different parties involved in LLM creation and deployment share the information needed to resolve questions of provenance. Otherwise, it may become increasingly difficult to train newer versions of LLMs without access to data that were crawled from the internet before the mass adoption of the technology or direct access to data generated by humans at scale."
Far be it from The Register to enjoy the vantage point of hindsight, but maybe somebody should have thought about this before the industry – and its investors – bet the farm on LLMs. ®