This article is more than 1 year old

AI is going to eat itself: Experiment shows people training bots are using bots

We speak to brains behind study into murky world of model teaching

Workers hired via crowdsource services like Amazon Mechanical Turk are using large language models to complete their tasks – which could have negative knock-on effects on AI models in the future.

Data is critical to AI. Developers need clean, high-quality datasets to build machine learning systems that are accurate and reliable. Compiling valuable, top-notch data, however, can be tedious. Companies often turn to third party platforms such as Amazon Mechanical Turk to instruct pools of cheap workers to perform repetitive tasks – such as labeling objects, describing situations, transcribing passages, and annotating text.

Their output can be cleaned up and fed into a model to train it to reproduce that work on a much larger, automated scale.

AI models are thus built on the backs of human labor: people toiling away, providing mountains of training examples for AI systems that corporations can use to make billions of dollars.

But an experiment conducted by researchers at the École polytechnique fédérale de Lausanne (EPFL) in Switzerland has concluded that these crowdsourced workers are using AI systems – such as OpenAI's chatbot ChatGPT – to perform odd jobs online.

Training a model on its own output is not recommended. We could see AI models being trained on data generated not by people, but by other AI models – perhaps even the same models. That could lead to disastrous output quality, more bias, and other unwanted effects.

The experiment

The academics recruited 44 Mechanical Turk serfs to summarize the abstracts of 16 medical research papers, and estimated that 33 to 46 percent of passages of text submitted by the workers were generated using large language models. Crowd workers are often paid low wages – using AI to automatically generate responses allows them to work faster and take on more jobs to increase pay.

The Swiss team trained a classifier to predict whether submissions from the Turkers were human- or AI-generated. The academics also logged their workers' keystrokes to detect whether the serfs copied and pasted text onto the platform, or typed in their entries themselves. There's always the chance that someone uses a chatbot and then manually types in the output – but that's unlikely, we suppose.

"We developed a very specific methodology that worked very well for detecting synthetic text in our scenario," Manoel Ribeiro, co-author of the study and a PhD student at EPFL, told The Register this week.

"While traditional methods try to detect synthetic text 'in any context', our approach is focused on detecting synthetic text in our specific scenario."

The classifier isn't perfect at identifying whether someone used an AI system or produced their own work. The academics combined their classifier's output with the keystroke data to be more certain when someone copy-pasted from a bot or produced their own material.

Human data is the gold standard, because it is humans that we care about

"We managed to validate our results using keystroke data we also collected from MTurk," Ribeiro told us. "For example, we found that all texts that were not copy-pasted were classified by us as 'real', which suggests that there are few false positives."

The code and data used to run the test can be found here, on GitHub.

There's another reason the experiment is unlikely to be a completely fair representation of how many workers really are using AI to automate crowdsource tasks. The authors note that the text summarization task is well-suited to large language models compared to other types of jobs – meaning that their results might be more skewed towards a higher number of workers using tools like ChatGPT.

Their dataset of 46 responses from 44 workers is also small. The workers were paid $1 for each text summary, which again may only encourage the use of AI.

Large language models will get worse if they are increasingly trained on fake content generated by AI collected from crowdsource platforms, the researchers argued. Outfits like OpenAI keep exactly how they train their latest models a close secret, and may not heavily rely on things like Mechanical Turk, if at all. That said, plenty of other models may rely on human workers, which may in turn use bots to generate training data, which is a problem.

Mechanical Turk, for one, is marketed as a provider of "data labeling solutions to power machine learning models."

"Human data is the gold standard, because it is humans that we care about, not large language models," Riberio said. "I wouldn't take a medicine that was only tested in a Drosophila biological model," he said as an example.

Responses generated by today's AI models are usually quite bland or trivial, and do not capture the complexity and diversity of human creativity, the researchers argued.

"Sometimes what we want to study with crowdsourced data is precisely the ways in which humans are imperfect," Robert West, co-author of the paper and an assistant professor in the EPFL's school of computer and communication science, told us.

As AI continues to improve, it's likely that crowdsourced work will change. Riberio speculated that large language models could replace some workers at specific tasks. "However, paradoxically, human data may be more precious than ever and thus it may be that these platforms will be able to implement ways to prevent large language model usage and ensure it remains a source of human data."

Who knows – maybe humans might even end up collaborating with large language models to generate responses too, he added. ®

More about

TIP US OFF

Send us news


Other stories you might like