LLMs can hoover up data from books, judge rules

Anthropic scores a qualified victory in fair use case, but got slapped for using over 7 million pirated copies

One of the most tech-savvy judges in the US has ruled that Anthropic is within its rights to scan purchased books to train its Claude AI model, but that pirating content is legally out of bounds.

In training its model, Anthropic bought millions of books, many second-hand, then cut them up and digitized the content. It also downloaded over 7 million pirated books from Books3 dataset, Library Genesis (Libgen), and the Pirate Library Mirror (PiLiMi), and that was the sticking point for Judge William Alsup of California's Northern District court.

On Monday, he ruled that simply digitizing a print copy counted as fair use under current US law, as there was no duplication of the copyrighted work since the printed pages were destroyed after they were scanned. However, Anthropic may have to face trial over the use of pirated material.

"As Anthropic trained successive LLMs, it became convinced that using books was the most cost-effective means to achieve a world-class LLM," Alsup wrote [PDF] in Monday's ruling. "During this time, however, Anthropic became 'not so gung ho about' training on pirated books 'for legal reasons.' It kept them anyway."

Anthropic became 'not so gung ho about' training on pirated books 'for legal reasons.'

The case was filed by three authors - Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson - who claimed that Anthropic illegally used their fiction and non-fiction works to train Claude. At least two of each author's books were included in the pirated material used by Anthropic.

Alsup noted that Anthropic hired the former head of partnerships at Google’s book-scanning project, Tom Turvey, who began conversations with publishers about licensing content, as other AI developers have done. But these talks were abandoned in favor of simply buying millions of books, taking the pages out, and scanning them, which the judge ruled was fair use.

"We are pleased that the Court recognized that using 'works to train LLMs was transformative — spectacularly so,'" an Anthropic spokesperson told The Register.

"Consistent with copyright’s purpose in enabling creativity and fostering scientific progress, Anthropic's LLMs trained upon works not to race ahead and replicate or supplant them — but to turn a hard corner and create something different."

On the matter of piracy, however, Alsup noted that in January or February 2021, Anthropic cofounder Ben Mann "downloaded Books3, an online library of 196,640 books that he knew had been assembled from unauthorized copies of copyrighted books — that is, pirated." In June, he downloaded "at least five million copies of books" from Libgen, and in July 2022, another two million copies were downloaded from PiLiMi, both of which Alsup classified as "pirate libraries."

Alsup found that the pirated works weren't necessarily used to train Claude, but that the company had retained them. That could prove legally problematic for the startup, Alsup ruled, since they were kept for "Anthropic’s pocketbook and convenience," he found.

"This order grants summary judgment for Anthropic that the training use was a fair use. And, it grants that the print-to-digital format change was a fair use for a different reason," he wrote.

"But it denies summary judgment for Anthropic that the pirated library copies must be treated as training copies. We will have a trial on the pirated copies used to create Anthropic’s central library and the resulting damages, actual or statutory (including for willfulness). That Anthropic later bought a copy of a book it earlier stole off the internet will not absolve it of liability for the theft but it may affect the extent of statutory damages."

Alsup's ruling is mixed news for Anthropic, but he does know his onions. For the last quarter of a century, Alsup has presided over some of the biggest tech trials in history, and his rulings have been backed up by the Supreme Court in some cases.

Alsup, a coder for over two decades (primarily in BASIC), presided over the Oracle-Google trial over fair use of Java code in Android, which led him to dabbling in that language. More recently, he sentenced former Google self-driving car engineer Anthony Levandowski to 18 months in prison for stealing proprietary info from his work at Google and bringing it to a new startup, Otto, which he later sold to Uber. President Trump later commuted the sentence in 2021.

Bartz and Johnson had no comment at the time of going to press. Graeber declined to discuss the verdict. ®

More about

TIP US OFF

Send us news


Other stories you might like