Big brains divided over training AI with more AI: Is model collapse inevitable?
Gosh, here's us thinking recursion was a solved problem
AI model collapse – the degradation of quality expected from machine learning models that recursively train on their own output – is not inevitable, at least according to 14 academics.
The risk that ongoing generative AI output, known as synthetic data, will dilute human-created organic data and impair the performance of models trained on this increasingly fabricated corpus was highlighted by a separate group last year, in a paper titled: "The Curse of Recursion: Training on Generated Data Makes Models Forget."
Ilia Shumailov, lead author of that paper, spoke to The Register earlier this year about this phenomenon, which has been documented in other studies.
Now another set of boffins – Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Henry Sleight, John Hughes, Tomasz Korbak, Rajashree Agrawal, Dhruv Pai, Andrey Gromov, Daniel Roberts, Diyi Yang, David Donoho, and Sanmi Koyejo – contend that the problem of training AI on AI-made data isn't significant, given the way that model training is actually done.
This latest baker's dozen plus one – from Stanford, AI safety group Constellation, the University of Maryland at College Park, MIT, and Sequoia Capital – make the case for not worrying in a paper titled: "Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data."
It's worth noting that some of these boffins acknowledge support through grants from commercial entities including OpenAI and Google, although the authors insist their research results do not necessarily reflect the positions or policies of their funders.
Gerstgrasser, a postdoctoral research associate at Harvard SEAS and visiting postdoctoral scholar at Stanford, outlined on social media the argument he and his colleagues want to make.
"As AI-generated content becomes more prevalent on the internet, there's a growing concern that future AI models will be trained on this 'tainted' data," he asserted. "It's like a virus that could infect the entire AI ecosystem!
"Many experts have warned that this could lead to a doomsday scenario for AI. If models keep getting worse and worse with each generation, we could face an 'AI apocalypse'! But don't panic just yet …"
- AI is going to eat itself: Experiment shows people training bots are using bots
- Prompt engineering is a task best left to AI models
- What's up with AI lately? Let's start with soaring costs, public anger, regulations...
- AI agents can copy humans to get closer to artificial general intelligence, DeepMind finds
Gerstgrasser argued that while previous studies have warned about this "doomsday scenario," all that research relies on the assumption that each succeeding generation of AI would train exclusively on the synthetic data produced by the previous generation model.
He argues that legacy data won't just be discarded. Instead of being replaced every generation, it's more likely to accumulate – the synthetic data will just get mixed with the organic data, and the resulting model will continue to perform.
"Our findings extend these prior works to show that if data accumulates and models train on a mixture of 'real' and synthetic data, model collapse no longer occurs," Gerstgrasser et al declare in their "Is Model Collapse Inevitable?" paper.
"[T]hese results strongly suggest that the 'curse of recursion' may not be as dire as had been portrayed – provided we accumulate synthetic data alongside real data, rather than replacing real data by synthetic data only."
But the authors of a related paper – Elvis Dohmatob, Yunzhen Feng, and Julia Kempe – titled, "Model Collapse Demystified: The Case of Regression," disagree that synthetic data can be added to model training without consequence.
All about scale
Julia Kempe, professor of computer science, mathematics and data science at the New York University Center for Data Science and Courant Institute of Mathematical Sciences, told The Register the "Is Model Collapse Inevitable?" paper is misguided in its conclusions – noting that it largely relies on the work that she and her colleagues did.
"Usually, when you train a model on lots of data, it gets better and better the more data you train on," Kempe explained. "This relation is called a 'scaling law' and has been shown to hold both empirically in many settings, and theoretically in several models.
"In our paper we show that when a model is trained on synthetic data that comes from a previous model that itself was generated on data from a previous model and so on, for a number of times (let us call the number of times n), then its performance does not obey the usual scaling laws; rather, it behaves effectively as if it had only been trained on an n-fraction of original data.
"For example, if we iteratively train and synthesize ten times, and then use the data from the last model to train, then we only get the performance we would get had we trained on 1/10th of the original data, so much worse!"
Yunzhen Feng, a doctoral student in data science at New York University and one of Kempe's co-authors, also disagreed with the "Is Model Collapse Inevitable?" paper and its suggestion that model collapse can be discounted.
If the objective is to maintain a good performance, it might be preferable to consistently use the original dataset
"If the objective is to maintain a good performance, it might be preferable to consistently use the original dataset, which is already stored and selected prior to introducing synthetic data," Feng explained.
"Our aim is to keep the scaling benefits," Feng continued. "In the scaling regime, using clean data to increase the dataset size tenfold results in better scaling. Conversely, using synthetic data not only forfeits these benefits but also introduces a performance degradation. Therefore, we disagree with them."
Feng also pointed to another paper – by Dohmatob, Feng, Pu Yang, Francois Charton, and Kempe – titled, "Tale of Tails: Model Collapse as a Change of Scaling Laws," and told The Register: "We argue that model collapse in AI data, from a scaling perspective, is twofold: It involves losing the performance benefits that additional human data would normally provide, and it results in recursive degradation across generations and retraining on AI data."
Feng noted that while there are various strategies that can be implemented to halt recursive degradation, there are performance consequences: "I believe most people do not regard solving only the second issue as sufficient to claim avoidance of model collapse."
Counterpoint
It's worth saying that Shumailov and his colleagues – Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and the late Ross Anderson – weren't really pitching the idea that AI is doomed to devour itself in their "Curse of Recursion" paper. Their conclusion was more subtle: That the model collapse can be mitigated by spending money to assure data quality – something big companies will find easier than small ones.
Asked about the findings from Gerstgrasser et al, Shumailov replied, "In principle it does not really invalidate anything we showed. With simple models, they show they can attenuate some effects. Do note that this comes with ever increasing cost and doesn't solve any of the problems for common users, who will have no ability to keep data long term."
AI collapse isn't inevitable – but neither is model performance. ®