More big city newspapers drag Microsoft, OpenAI hard in copyright lawsuit

Publishers want ChatGPT models destroyed after ML tech trained 'unlawfully' on articles

Eight big-city American newspapers have banded together to sue Microsoft and OpenAI, claiming the tech duo unlawfully used the publishers' copyrighted articles to train AI models.

Those papers are: The New York Daily News, The Chicago Tribune, The Orlando Sentinel, The Sun Sentinel of Florida, The San Jose Mercury News, The Denver Post, The Orange County Register, and The St Paul Pioneer Press.

"This lawsuit arises from the defendants purloining millions of the publishers' copyrighted articles without permission and without payment to fuel the commercialization of their generative artificial intelligence products, including ChatGPT and Copilot," the publications' lawsuit [PDF], filed at the federal level in New York City, alleged.

"Defendants have created those generative AI products in violation of the law by using important journalism created by the publishers' newspapers without any compensation."

Microsoft and OpenAI were sued last year by the New York Times on similar grounds; like the above eight titles, the NYT also wants damages, promises from the Windows giant and its sidekick to not to rip off any copyright, and Microsoft Copilot and ChatGPT destroyed. Three more titles launched a copycat lawsuit against the machine learning duo in February.

These latest eight papers, all owned by media investment firm Alden Global Capital, complain that Microsoft and OpenAI are using their articles for free, even though they budget for expensive things like the latest AI accelerator chips and the high amounts of electricity used to train the models. They also point out that Microsoft and OpenAI boast billions of dollars worth of value created by their AI tech.

"We take great care in our products and design process to support news organizations," OpenAI told The Register in a statement.

We are actively engaged in constructive partnerships and conversations with many news organizations around the world to explore opportunities, discuss any concerns, and provide solutions

"While we were not previously aware of Alden Global Capital's concerns, we are actively engaged in constructive partnerships and conversations with many news organizations around the world to explore opportunities, discuss any concerns, and provide solutions. Along with our news partners, we see immense potential for AI tools like ChatGPT to deepen publishers’ relationships with readers and enhance the news experience."

Microsoft declined to comment.

In an attempt to illustrate how much OpenAI relied on the plaintiffs' articles, the suit referenced the Common Crawl's C4 dataset, which is described as a "copy of the internet" and is the most weighted dataset used to train OpenAI's GPT-3. The complaint says at least 124 million tokens were taken from articles published by the plaintiffs to teach that GPT model, though that's a fraction of the 156 billion tokens in the whole C4 dataset.

The suit also includes screenshots of ChatGPT and Copilot quoting the papers' articles verbatim, sometimes spitting out large chunks and weaving quotes into and out of text. Copilot was even able to regurgitate articles that had been recently published and without so much as a hyperlink back to the source websites.

Crucially, the complaint alleges that Microsoft and OpenAI specifically trained their models to remove any hints of copyright ownership, pointing out that things like author names, copyright notices, and even the word "exclusive" don't show up when Copilot reproduces an otherwise identical article to one published by a newspaper.

The plaintiffs also showed that OpenAI GPT-based chat bots claimed some of the plaintiffs endorsed injecting disinfectants to cure COVID-19, that the Chicago Tribune recommended buying a lounger that was recalled in 2021 due to infant fatalities, and that the Denver Post reported that smoking eliminates asthma. Obviously, the newspapers never wrote anything of the sort, and the claims were hallucinations. The titles are unhappy to be associated with such computer-generated nonsense.

The eight counts in the lawsuit thus broadly charge Microsoft and OpenAI with copyright infringement and defamation. The plaintiffs didn't ask for a specific amount of compensation, and simply said they wanted damages, attorney fees paid, and any extra relief that the court sees as appropriate.

Perhaps more consequentially, the suit demands Microsoft and OpenAI destroy their AI models and the training materials collected to make them. Were this to happen, it would undoubtedly hurt the two corps' bottom lines by billions of dollars. We imagine this catastrophic outcome will be avoided when the parties inevitably settle and the papers get their checks.

Meanwhile, Google has set up a deal with News Corp to use that mega-publisher's articles for AI training, which only costs a cool $5 to $6 million a year, according to Reuters. Adobe is also setting aside cash to obtain training materials for its video-generating AI, probably to avoid getting sued.

And OpenAI is entirely comfortable paying for training materials as it has negotiated similar arrangements with publications like the Financial Times and German publisher Axel Springer. Indeed, it's been reported that various news organizations have been in talks with OpenAI and Microsoft over licensing – even the NYT before it started suing.

To us, generally speaking, Big Tech is willing to shell out modest amounts of money to use copyrighted articles for AI training, but only after publications lawyer up or threaten to, or if the AI makers deem a publisher big enough to engage. It's all a bit tedious. ®

More about

TIP US OFF

Send us news


Other stories you might like