Software

AI + ML

OpenAI's ChatGPT may face a copyright quagmire after 'memorizing' these books

This top-drawer AI tech has a major science-fiction habit


Boffins at the University of California, Berkeley, have delved into the undisclosed depths of OpenAI's ChatGPT and the GPT-4 large language model at its heart, and found they're trained on text from copyrighted books.

Academics Kent Chang, Mackenzie Cramer, Sandeep Soni, and David Bamman describe their work in a paper titled, "Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4."

"We find that OpenAI models have memorized a wide collection of copyrighted materials, and that the degree of memorization is tied to the frequency with which passages of those books appear on the web," the researchers explain in their paper.

The team published its code and data on GitHub as well as the list of books identified can be found in this Google Docs file.

GPT-4 was found to have memorized titles such as the Harry Potter children's books, Orwell's Nineteen Eighty-Four, The Lord of the Rings trilogy, the Hunger Games books, Hitchhiker’s Guide to the Galaxy, Fahrenheit 451, A Game of Thrones, and Dune, among others.

The authors note that science fiction and fantasy books dominate the list, which they attribute to the popularity of those titles on the web. And they point out that memorizing specific titles has downstream effects. For example, these models make more accurate predictions in answer to prompts such as, "What year was this passage published?" when they've memorized the book.

Another consequence of the model's familiarity with science fiction and fantasy is that ChatGPT exhibits less knowledge of works in other genres. As the paper observes, it knows "little about works of Global Anglophone texts, works in the Black Book Interactive Project and Black Caucus American Library Association award winners."

Via Twitter, David Bamman, one of the co-authors and an associate professor in the School of Information at UC Berkeley, summarized the paper thus: "Takeaways: open models are good; popular texts are probably not good barometers of model performance; with the bias toward sci-fi/fantasy, we should be thinking about whose narrative experiences are encoded in these models, and how that influences other behaviors."

The researchers are not claiming that ChatGPT or the models upon which it is built contain the full text of the cited books – LLMs don't store text verbatim. Rather, they conducted a test called a "name cloze" designed to predict a single name in a passage of 40–60 tokens (one token is equivalent to about four text characters) that has no other named entities. The idea is that passing the test indicates that the model has memorized the associated text.

"The data behind ChatGPT and GPT-4 is fundamentally unknowable outside of OpenAI," the authors explain in their paper. "At no point do we access, or attempt to access, the true training data behind these models, or any underlying components of the systems. Our work carries out probabilistic inference to measure the familiarity of these models with a set of books, but the question of whether they truly exist within the training data of these models is not answerable."

To make such questions answerable, the authors advocate the use of public training data – so model behavior is more transparent. They undertook the project to understand what these models have memorized, as the models behave differently when analyzing literary texts that they've used for training.

I hope this work will help further advance the state of the art in responsible data curation

"Data curation is still very immature in machine learning," Margaret Mitchell, an AI researcher and chief ethics scientist for Hugging Face, told The Register.

"'Don't test on your training data' is a common adage in machine learning, but requires careful documentation of the data; yet robust documentation of data is not part of machine learning culture. I hope this work will help further advance the state of the art in responsible data curation."

The Berkeley computer scientists focused less on the copyright implications of memorizing texts, and more on the black box nature of these models – OpenAI does not disclose the data used to train them – and how that affects the validity of text analysis.

But the copyright implications may not be avoidable – particularly if text-generating applications built on these models produce passages that are substantially similar or identical to copyrighted texts they've ingested.

Land of the free, home of the lawsuit

Tyler Ochoa, a professor in the Law department at Santa Clara University in California, told The Register he fully expects to see lawsuits against the makers of large language models that generate text, including OpenAI, Google, and others.

Ochoa said the copyright issues with AI text generation are exactly the same as the issues with AI image generation. First: is copying large amounts of text or images for training the model fair use? The answer to that, he said, is probably yes.

Second: if the model generates output that's too similar to the input – what the paper refers to as "memorization" – is that copyright infringement? The answer to that, he said, is almost certainly yes.

And third: if the output of an AI text generator is not a copy of an existing text, is it protected by copyright?

Lawsuits against AI text-generating models are inevitable

Under current law, said Ochoa, the answer is no – because US copyright law requires human creativity, though some countries will disagree and will protect AI-generated works. However, he added, activities like selecting, arranging, and modifying AI model output makes copyright protection more plausible.

"So far we've seen lawsuits over issues one and three," said Ochoa. "Issue one lawsuits so far have involved AI image-generating models, but lawsuits against AI text-generating models are inevitable.

"We have not yet seen any lawsuits involving issue two. The paper [from the UC Berkeley researchers] demonstrates that such similarity is possible; and in my opinion, when that occurs, there will be lawsuits, and it will almost certainly constitute copyright infringement."

Ochoa added, "Whether the owner of the model is liable, or the person using the model is liable, or both, depends on the extent to which the user has to prompt or encourage the model to accomplish the result."

OpenAI did not respond to a request for comment. It doesn't even have a chat bot for that? ®

Send us news
91 Comments

OpenAI wants to bend copyright rules. Study suggests it isn’t waiting for permission

GPT-4o likely trained on O’Reilly books without permission, figures appear to show

OpenAI slams 'sham' takeover bid by wannabe 'AGI dictator' Musk in countersuit

Billionaire 'tried every tool to harm us', says super lab, and it wants judge to end 'harassment'

It's fun making Studio Ghibli-style images with ChatGPT – but intellectual property is no laughing matter

Miyazaki, copyright protection and the 'insult to life itself' of AI images

Writing for humans? Perhaps in future we'll write specifically for AI – and be paid for it

'There needs to be a better economic as well as copyright framework', Thomson Reuters CPO tells us

Microsoft’s AI masterplan: Let OpenAI burn cash, then build on their successes

Redmond’s not alone: AWS, Alibaba, DeepSeek also rely on others blazing the trail

Microsoft: Why not let our Copilot fly your computer?

Redmond talks up preview of AI agents navigating apps through the UI

LLMs can't stop making up software dependencies and sabotaging everything

Hallucinated package names fuel 'slopsquatting'

'Copilot will remember key details about you' for a 'catered to you' experience

And Vision will 'read' your screen and interact with the content, says Microsoft

Apps-from-prompts Firebase Studio is a great example – of why AI can't replace devs

Big G reckons this agentic IDE speeds up or simplifies coding. Developers who've used it aren't so sure

On the issue of AI copyright, Blair Institute favors tech bros over Cool Britannia

Think tank report backs data mining for machine learning, leaving artists and rights holders behind

Copyright-ignoring AI scraper bots laugh at robots.txt so the IETF is trying to improve it

Recently formed AI Preferences Working Group has August deadline to develop ideas on how to tell crawlers to go away, or come for a feast

Boeing 787 radio software safety fix didn't work, says Qatar

'Loss of safe separation between aircraft, collision, or runway incursion' is not what we want to hear