Pulitzer Prize winning author Michael Chabon and others sue OpenAI

Another would be copyright class action

Pulitzer Prize winning US novelist Michael Chabon and several other writers are the latest to file a proposed class action accusing OpenAI of copyright infringement, alleging it pulled their work into the datasets used to train the models behind ChatGPT.

The suit claims that OpenAI "cast a wide net across the internet" to capture the most comprehensive set of content available to better train its GPT models, allegedly "necessarily" leading it "to capture, download, and copy copyrighted written works, plays and articles."

One of the more interesting parts of the lawsuit is an allegation about how the authors believe the AI business got its hands on "two internet-based book corpora," which it notes that OpenAI simply refers to as "Books1" and "Books2." The filing alleges that in the July 2020 paper introducing GPT-3, "Language Models are Few-Shot Learners," OpenAI disclosed that in addition to "Common Crawl" and "WebText" web page datasets, "16 percent of the GPT3 training dataset came from... 'Books1' and 'Books2'."

The writers lawsuit goes on to allege that there are only a few places on the public internet that contain this much material, claiming that OpenAI's Books1 dataset "is based on either the Standardized Project Gutenberg Corpus or Project Gutenberg itself" and accusing the AI biz of sourcing Books2 from:

infamous "shadow library" websites, like Library Genesis ("LibGen"), Z-Library, Sci-Hub, and Bibliotik, which host massive collections of pirated books, research papers, and other text-based materials. The materials aggregated by these websites have also been available in bulk through torrent systems.

Also included in the suit is Tony and Grammy award winner David Henry Hwang, the playwright and screenwriter behind M. Butterfly, Chinglish, Yellow Face, and The Dance and the Railroad; Peabody winner and Love and other Impossible Pursuits author Ayelet Waldman; Women We Buried author Rachel Louise Snyder; and Who is Rich? scribe Matthew Klam.

The writers allege that because "when ChatGPT is prompted, it generates not only summaries, but in-depth analyses of the themes present in Plaintiffs' copyrighted works," the writers believe "the underlying GPT model was trained using [the] plaintiffs' works."

The writers' lawyers also claim that when asked to write a paragraph in the style of The Amazing Adventures of Kavalier & Clay, the book that bagged US novelist Chabon his Pulitzer, ChatGPT generated a passage imitating his writing style and including references to the characters dealing with "the weight of the world at war."

Screenshot from the complaint, exhibit A (click to enlarge)

The suit [PDF] was filed in California federal court late last week and was yesterday assigned to San Francisco Magistrate Judge Peter H. Kang.

Ex-US pres Bill Clinton has written a cyber-attack pulp thriller. With James Patterson. Really


OpenAI is facing multiple lawsuits around copyright – including two in San Francisco filed by novelists Paul Tremblay and Mona Awad, and, separately, comedian Sarah Silverman and novelists Christopher Golden and Richard Kadrey. Its lawyers argued in those cases that the AI biz has not violated copyright laws, claiming ChatGPT's LLMs are protected under the US doctrine of "fair use." Their argument is that the way the business uses the text conforms to US copyright law, which allows a fair use exception for so-called "transformative uses" of work – a remix of the original that serves a different purpose or audience.

The US Copyright Office is currently seeking comment on a study of the copyright law and policy issues raised by artificial intelligence systems.

Defense for OpenAI hasn't yet filed a response to the Chabon complaint. We have asked OpenAI for comment.

The allegations in the case include direct and vicarious copyright infringement, illegal removal of copyright management information, unfair competition, and unjust enrichment. They are seeking an injunction against the infringement of their copyrights as well as unspecified damages.

OpenAI boss Sam Altman last week scored Indonesia's first ever golden visa – meaning he can now live in the archipelagic nation for up to 10 years – in recognition of his potential to "generate inbound investment." ®

Send us news

Authors Guild sues OpenAI for using <i>Game of Thrones</i> and other novels to train ChatGPT

Class action alleges pirated novels were fed into binary brainbox

OpenAI urges court to throw out authors' claims in AI copyright battle

ChatGPT's prose harvesting protected by fair use, super-lab argues

Textbook publishers sue shadow library LibGen for copyright infringement

Yet another attempt at a permanent takedown – but will it stick?

ChatGPT will soon accept speech and images in its prompts, and be able to talk back to you

Yakety Yak - AI talks back

Microsoft to shield paid-up Copilot customers from any AI copyright brawls it starts

Tough luck, freeloaders: You're on your own

OpenAI's DALL·E 3 teams up with ChatGPT to turn brainfarts into art

Plus: Microsoft GitHub release Copilot Chat to all developers on VS Code, and more

Getty delivers text-to-image service it says won't get you sued, may get you paid

Trained on its own image library that's clear of copyright complications

Gandalf chatbot security game counters privacy fireballs

You shall not pass judgement, Lakera AI insists, because exposed player info was harmless

GitHub Copilot, Amazon Code Whisperer sometimes emit other people's API keys

AI dev assistants can be convinced to spill secrets learned during training

OpenAI pops an enterprise sticker on ChatGPT to give big biz some peace of mind

Here's what you actually get for this VIP level. And how is Microsoft happy with this?

Arm wrestles assembly language guru's domains away citing trademark issues

Maria Markstedter spent years writing about chip biz's ISA, is a tad miffed by heavy-handed takedown tactics

You can now fine-tune OpenAI's GPT-3.5 for specific tasks – it may even beat GPT-4

And work out cheaper than top-end model