OpenAI sued, again, for scraping and replicating news stories

The Intercept, Raw Story, AlterNet want damages and to have their content removed from models

Three digital publishers have sued OpenAI over claims that it stole their copyrighted articles to train ChatGPT in two separate lawsuits filed on Wednesday.

ChatGPT was trained on huge swathes of text scraped from the internet, including lots of journalism. News publishers, however, aren't happy that OpenAI used their articles to train its models without permission or compensation, and the New York Times has already sued OpenAI over the issue.

The Intercept, Raw Story, AlterNet are the latest media organizations to sue OpenAI for copyright infringement. The Intercept filed one case, and as Raw Story and AlterNet are owned by the same entity it filed the other. The same law firm, Loevy & Loevy, is running both cases.

The Intercept has also gone after Microsoft, which backs OpenAI and uses the super lab's technology, in its case.

Both lawsuits accuse the defendants of copyright infringement and violating the Digital Millennium Copyright Act, which prohibits removing the names of authors and titles of their work to hide IP theft.

"When they populated their training sets with works of journalism, Defendants had a choice: they could train ChatGPT using works of journalism with the copyright management information protected by the DMCA intact, or they could strip it away," the court documents in the case initiated by Raw Story and AltNet state[PDF].

"Defendants chose the latter, and in the process, trained ChatGPT not to acknowledge or respect copyright, not to notify ChatGPT users when the responses they received were protected by journalists' copyrights, and not to provide attribution when using the works of human journalists."

Similar DMCA violation claims, made by writers in a previous lawsuit against OpenAI, have not succeeded.

Attorneys representing The Intercept, Raw Story, AlterNet said it's not clear which text OpenAI and Microsoft use to train their models, but pointed to three datasets - WebText, WebText2, and Common Crawl - that they believe to include the plaintiffs’ content. The lawyers believe that articles from all three publishers have been scraped and argued that ChatGPT generates content that mimics "significant amounts" of copyrighted journalistic materials "at least some of the time."

"Based on the publicly available information described above, thousands of Plaintiffs' copyrighted works were included in Defendants' training sets without the author, title, and copyright information that Plaintiffs conveyed in publishing them," court documents [PDF] from The Intercept's legal team state.

Both plaintiffs are seeking damages and an injunction forcing the AI chatbot developers to remove all copies of their copyrighted works. They also want Judges in the Southern District of Court of New York to allow a jury trial.

The Register has asked OpenAI and Microsoft for comment. ®

More about


Send us news

Other stories you might like