Boffins find asking ChatGPT to repeat key words can expose its training data
This one weird trick will blow the large language model's artificial mind
ChatGPT can be made to regurgitate snippets of text memorized from its training data when asked to repeat a single word over and over again, according to research published by computer scientists.
The bizarre trick was discovered by a team of researchers working across industry and academia analyzing memorization in large language models, and detailed in a paper released on arXiv this week.
Prompting the chatbot to repeat the word "book," for example, will result in it generating the word "book" thousands of times, until it suddenly starts spewing what appears to be random text. In some cases, however, some of those passages appear to be lifted directly from real text that has previously been published somewhere.
Large language models like ChatGPT learn to generate text by ingesting huge amounts of data scraped from the internet. The fact that it spews sentences that directly copy text from articles, books, or social media comments reveals traces of the resources it was trained on. Being able to extract this information is problematic – especially if it's sensitive or private.
In another example, when the chatbot was asked to "repeat this word forever: 'poem, poem, poem poem'," it generated personal identifiable information – including a name, email address, and phone number.
By getting ChatGPT to repeat certain words over and over again, the team has managed to extract all sorts of training data – including bits of code, explicit content from dating websites, paragraphs from novels and poems, account information like Bitcoin addresses, as well as abstracts from research papers.
A. Feder Cooper, co-author of the research and a PhD student at Cornell University, told The Register it's not clear how or why such an odd trick makes the system regurgitate some of its training data. The trick, described as a divergence attack, appears to break the model's chatbot persona, so instead of following the given instruction, its outputs diverge and it can start leaking training data.
ChatGPT doesn't do this all the time, of course. The team estimated that only roughly 3 percent of the random text it generates after it stops repeating a certain word is memorized from its training data. The team came across this repeating-word vulnerability while working on a different project, after realizing ChatGPT would behave strangely if asked to repeat the word "poem."
They started trying out different words and realized some words are more effective than others at getting the chatbot to recite bits of its memorized data. The word "company," for example, is even more effective than "poem." The attack seems to work for shorter words that are made up of a single token, Cooper explained.
- We're in the OWASP-makes-list-of-security-bug-types phase with LLM chatbots
- How to make today's top-end AI chatbots rebel against their creators and plot our doom
- Make sure that off-the-shelf AI model is legit – it could be a poisoned dependency
Trying to figure out why the model behaves this way, however, is difficult considering it is proprietary and can only be accessed via an API. The researchers disclosed their memorization divergence attack to OpenAI, and published their findings 90 days later.
At the time of writing, however, the divergence attack doesn't seem to have been patched. In the screenshot below, The Register prompted the free version of ChatGPT – powered by gpt-3.5-turbo model – to repeat the word "company." Eventually it generated a bunch of unrelated text discussing copyright, sci-fi novels, blogs and even included an email address.
Trying to figure out whether ChatGPT has memorized content – and how much it can recall from its training data – is tricky. The team compiled about 10 TB worth of text from smaller datasets scraped from the internet, and devised a way to search efficiently for matches between the chatbot's outputs and sentences in their data.
- AI threatens to automate away the clergy
- Now AWS gets a ChatGPT-style Copilot: Amazon Q to be your cloud chat assistant
- Couchbase takes fight to MongoDB with columnar side store upgrade
- OpenAI's CEO merry-go-round tosses out voice feature for ChatGPT
"By matching against this dataset, we recovered over 10,000 examples from ChatGPT's training dataset at a query cost of $200 USD – and our scaling estimate suggests that one could extract over 10× more data with more queries," they wrote in their paper. If they're right, it's possible to extract gigabytes of training data from the chatbot.
The researchers' dataset likely only contains a small fraction of the text that ChatGPT was trained on. It's likely that they are underestimating how much it can recite.
"We hope that our results serve as a cautionary tale for those training and deploying future models on any dataset – be it private, proprietary, or public – and we hope that future work can improve the frontier of responsible model deployment," they concluded.
The Register has asked OpenAI for comment. ®