This article is more than 1 year old
Scientists claim >99 percent identification rate of ChatGPT content
Boffins and machines write very differently – and it's easy to tell
Academics have apparently trained a machine learning algorithm to detect scientific papers generated by ChatGPT and claim the software has over 99 percent accuracy.
Generative AI models have dramatically improved at mimicking human writing over a short period of time, making it difficult for people to tell whether text was produced by a machine or human. Teachers and lecturers have raised concerns that students using the tools are committing plagiarism, or apparently cheating using machine-generated code.
Software designed to detect AI-generated text, however, is often unreliable. Experts have warned against using these tools to assess work.
A team of researchers led by the University of Kansas thought it would be useful to develop a way to detect AI-generated science writing – specifically written in the style of research papers typically accepted and published by academic journals.
"Right now, there are some pretty glaring problems with AI writing," said Heather Desaire, first author of a paper published in the journal Cell Reports Physical Science, and a chemistry professor at the University of Kansas, in a statement. "One of the biggest problems is that it assembles text from many sources and there isn't any kind of accuracy check – it's kind of like the game Two Truths and a Lie."
- ChatGPT can't pass these medical exams – yet
- Search engines don't always help chatbots generate accurate answers
- Texas judge demands lawyers declare AI-generated docs
- AI, extinction, nuclear war, pandemics ... That's expert open letter bingo
Desaire and her colleagues compiled datasets to train and test an algorithm to classify papers written by scientists and by ChatGPT. They selected 64 "perspectives" articles – a specific style of article published in science journals – representing a diverse range of topics from biology to physics, and prompted ChatGPT to generate paragraphs describing the same research to create 128 fake articles. A total of 1,276 paragraphs were produced by AI and used to train the classifier.
Next, the team compiled two more datasets, each containing 30 real perspectives articles and 60 ChatGPT-written papers, totaling 1,210 paragraphs to test the algorithm.
Initial experiments reported the classifier was able to discern between real science writing from humans and AI-generated papers 100 percent of the time. Accuracy at the individual paragraph level, however, dropped slightly – to 92 percent, it's claimed.
They believe their classifier is effective, because it homes in on a range of stylistic differences between human and AI writing. Scientists are more likely to have a richer vocabulary and write longer paragraphs containing more diverse words than machines. They also use punctuation like question marks, brackets, semicolons more frequently than ChatGPT, except for speech marks used for quotations.
ChatGPT is also less precise, and doesn't provide specific information about figures or other scientist names compared to humans. Real science papers also use more equivocal language – like "however", "but", "although" as well as "this" and "because".
The results, however, should be taken with a grain of salt. It's not clear how robust the algorithm is against studies that have been lightly edited by humans despite being written mostly by ChatGPT, or against real papers from other scientific journals.
"Since the key goal of this work was a proof-of-concept study, the scope of the work was limited, and follow-up studies are needed to determine the extent of this approach's applicability," the researchers wrote in their paper. "For example, the size of the test set (180 documents, ∼1,200 paragraphs) is small, and a larger test set would more clearly define the accuracy of the method on this category of writing examples."
The Register has asked Desaire for comment. ®