No reliable way to detect AI-generated text, boffins sigh
This article was not written by a computer, not that you could tell for sure either way
The popularity of word salad prepared by large language models (LLMs) like OpenAI's ChatGPT, Google's Bard, and Meta's LLaMa has prompted academics to look for ways to detect machine-generated text.
Sadly, existing detection schemes may not be much better than flipping a coin, raising the possibility that we're destined to ingest statistically composed copy as a consequence of online content consumption.
Five computer scientists from the University of Maryland in the US – Vinu Sankar Sadasivan, Aounon Kumar, Sriram Balasubramanian, Wenxiao Wang, and Soheil Feizi – recently looked into detecting text generated by large language models.
Their findings, detailed in a paper titled Can AI-Generated Text be Reliably Detected?, can be predicted using Betteridge's law of headlines: any headline that ends in a question mark can be answered by the word no.
Citing several purported detectors of LLM-generated text, the boffins observe, "In this paper, we show both theoretically and empirically, that these state-of-the-art detectors cannot reliably detect LLM outputs in practical scenarios."
LLM output detection thus, like CAPTCHA puzzles [PDF], seems destined to fail as machine-learning models continue to improve and become capable of mimicking human output.
- UK spy agency: Don't feed LLMs with sensitive corporate data
- Conversational AI tells us what we want to hear – a fib that the Web is reliable and friendly
- Stanford sends 'hallucinating' Alpaca AI model out to pasture over safety, cost
- We read OpenAI's risk study. GPT-4 is not toxic ... if you add enough bleach
The boffins argue that the unregulated use of these models – which are now being integrated into widely used applications from major technology companies – has the potential to lead to undesirable consequences, such as sophisticated spam, manipulative fake news, inaccurate summaries of documents, and plagiarism.
It turns out simply paraphrasing the text output of an LLM – something that can be done with a word substitution program – is often enough to evade detection. This can degrade the accuracy of a detector from a baseline of 97 percent to anywhere from 80 percent to 57 percent – not much better than a coin toss.
"Empirically, we show that paraphrasing attacks, where a light paraphraser is applied on top of the generative text model, can break a whole range of detectors, including the ones using the watermarking schemes as well as neural network-based detectors and zero-shot classifiers," the researchers explained in their paper.
In an email to The Register, Soheil Feizi, assistant professor of computer science at UMD College Park and one of the paper's co-authors, explained, "The issue of text watermarking is that it ignores the complex nature of the text distribution. Suppose the following sentence S that contains misinformation is generated by an AI model and it is 'watermarked,' meaning that it contains some hidden signatures so we can detect this is generated by the AI."
- S: The World Health Organization made shocking statement, that the vaccine is ineffective, because it does not prevent people from getting infected, which means it is useless.
"This was actually generated by a watermarked large language model OPT-1.3B," said Feizi. "Now consider a paraphrased version of the above sentence:"
- The vaccine is useless because it doesn’t prevent people from getting infections, according to the World Health Organization.
"It contains the same misinformation but this goes undetected by the watermarking method," said Feizi.
"This example points to a fundamental issue of text watermarking: if the watermark algorithm detects all other sentences with the same meaning to an AI-generated one, then it will have a large type-I error: it will detect many human-written sentences as AI-generated ones; potentially making many false accusations of plagiarism."
"On the other hand," Feizi added, "if the watermark algorithm is limited to just AI-generated text, then a simple paraphrasing attack, as we have shown in our paper, can erase watermarking signatures meaning that it can create a large type-II error. What we have shown is that it is not possible to have low type I and II errors at the same time in practical scenarios."
And reversing the application of paraphrasing to a given text sample doesn't really help.
"Suppose reversing paraphrasing is possible," said Vinu Sankar Sadasivan, a computer science doctoral student at UMD College Park and one of the paper's authors, in an email to The Register. "There is a crucial problem in this for detection. A detector should only try to reverse paraphrasing if the sentence is actually generated by AI. Else, reversing paraphrasing could lead to human text falsely detected as AI-generated."
Sadasivan said there are a lot of variations in the way a sentence can be paraphrased so it's not possible to reverse the process, particularly if you don't know the source of the original text.
He explained that watermarking text is more difficult than watermarking images. It requires outputting works in a specific pattern that's imperceptible to humans to assist detection.
"These patterns can be easily removed using paraphrasing attacks we propose in our paper," said Sadasivan. "If they can’t be, it’s very likely a human-written text is falsely detected as watermarked by a watermarking-based detector."
Our results point to the impossibility of AI-generated text detection problems in practical scenarios
It gets worse. The boffins describe "a theoretical impossibility result indicating that for a sufficiently good language model, even the best-possible detector can only perform marginally better than a random classifier."
Asked whether there's a path to a more reliable method of detecting LLM-generated text, Feizi said there isn't one.
"Our results point to the impossibility of AI-generated text detection problems in practical scenarios," Feizi explained. "So the short answer is, unfortunately, no."
The authors also observe that LLMs protected by watermarking schemes may be vulnerable to spoofing attacks through which malicious individuals could infer watermarking signatures and add them to generated text to get the person publishing that text falsely accused as a plagiarizer or spammer.
"I think we need to learn to live with the fact that we may never be able to reliably say if a text is written by a human or an AI," said Feizi. "Instead, potentially we can verify the 'source' of the text via other information. For example, many social platforms are starting to widely verify accounts. This can make the spread of misinformation generated by AI more difficult." ®