Writers sue Anthropic for feeding 'stolen' copyrighted work into Claude

Another day, another lawsuit over how AI lands training sets

Anthropic was sued on Monday by three authors who claim the machine-learning lab unlawfully used their copyrighted work to train its Claude AI model.

"Anthropic has built a multibillion-dollar business by stealing hundreds of thousands of copyrighted books," the complaint [PDF], filed in California, says. "Rather than obtaining permission and paying a fair price for the creations it exploits, Anthropic pirated them."

The lawsuit, on behalf of authors Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson, aspires to be recognized as a class action. The trio claim Anthropic screws authors out of their income by generating, on demand for Claude's users, a flood of AI-generated titles in a fraction of the time required for a human author to complete a book.

Claude could not generate this kind of long-form content if it were not trained on a large quantity of books, books for which Anthropic paid authors nothing

That automated generation of prose is only possible by training Claude on people's writing, for which they've not received a penny in compensation, it's argued.

"Claude in particular has been used to generate cheap book content," the complaint says.

"For example, in May 2023, it was reported that a man named Tim Boucher had 'written' 97 books using Anthropic’s Claude (as well as OpenAI’s ChatGPT) in less than [a] year, and sold them at prices from $1.99 to $5.99. Each book took a mere 'six to eight hours' to 'write' from beginning to end.

"Claude could not generate this kind of long-form content if it were not trained on a large quantity of books, books for which Anthropic paid authors nothing."

The filing contends that San Francisco-based Anthropic knowingly used datasets called The Pile and Books3 that incorporate Bibliotik, alleged to be "a notorious pirated collection," in order to avoid the cost of licensing content. The authors allege that Anthropic has broken US copyright law, and are seeking damages.

Many such lawsuits have been filed since 2022 when generative AI services like GitHub Copilot, Midjourney, and ChatGPT debuted.

Two cases filed in 2023 and one filed in 2024 involving authors – Authors Guild v. OpenAI Inc. (1:23-cv-08292), Alter et al v. OpenAI Inc. et al (1:23-cv-10211), and Basbanes v. Microsoft Corporation (1:24-cv-00084) – have been consolidated into a single case, Alter et al v. OpenAI (1:23-cv-10211).

Another set of author lawsuits – Tremblay v. OpenAI (3:23-cv-03223), Silverman v. OpenAI (3:23-cv-03416), and Chabon v. OpenAI (3:23-cv-04625) – has been consolidated into In re OpenAI ChatGPT (23-cv-03223-AMO).

These cases have been working their way through the American court system but it's not yet clear how the nation's copyright law will end up applying to AI training or AI output. Related litigation has also challenged the lawfulness of code-oriented models and image generation models.

Last year, the New York Times sued Open AI, making similar allegations: that the model maker has copied journalists' work and is profiting from that work unfairly by reproducing it. The issue became the focus of a Senate Judiciary Committee hearing in January.

OpenAI at the time argued, "Training AI models using publicly available internet materials is fair use, as supported by long-standing and widely accepted precedents. We view this principle as fair to creators, necessary for innovators, and critical for US competitiveness."

The AI giant claimed it would be impossible to train AI models without using copyrighted content.

That position has been supported by the Association of Research Libraries (ARL), but only with regard to input (training). The group allows that the output of an LLM "could potentially be infringing if it is substantially similar to an original expressive work."

Amidst legal uncertainty, the prospect of ruinous claims has led AI companies to enter into licensing arrangements with large publishers and other content providers. Doing so, however, makes the costly process of model training even more expensive.

The Register last year spoke to Tyler Ochoa, a professor in the Law department at Santa Clara University in California, who said while using copyrighted content for training probably qualifies as fair use, the output of an AI model probably isn't infringing unless it's sufficiently close to specific training data.

Anthropic did not respond to a request for comment. ®

PS: Publishing goliath Condé Nast today inked a multiyear deal with OpenAI so that ChatGPT and SearchGPT can pull up stories from its stable – namely, The New Yorker, Bon Appetit, Vogue, Vanity Fair, and WiReD.

More about

TIP US OFF

Send us news


Other stories you might like