OpenAI to reveal secret training data in copyright case – for lawyers' eyes only
Counsel for aggrieved authors will view info in a secure room, without internet access, and no devices present
OpenAI has agreed to reveal the data used to train its generative AI models to attorneys pursuing copyright claims against the developer on behalf of several authors.
The authors – among them Paul Tremblay, Sarah Silverman, Michael Chabon, David Henry Hwang, and Ta-Nehisi Coates – sued OpenAI and its affiliates last year, arguing its AI models have been trained on their books and reproduce their words in violation of US copyright law and California's unfair competition rules. The writers' actions have been consolidated into a single claim [PDF].
OpenAI faces similar allegations from other plaintiffs, and earlier this year, Anthropic was also sued by aggrieved authors.
On Tuesday, US magistrate judge Robert Illman issued an order [PDF] specifying the protocols and conditions under which the authors' attorneys will be granted access to OpenAI's training data.
The terms of access are strict, and consider the training data set the equivalent of sensitive source code, a proprietary business process, or secret formula. Even so, the models used for ChatGPT (GPT-3.5, GPT-4, etc.) presumably relied heavily on publicly accessible data that's widely known, as was the case with GPT-2 for which a list of domains whose content was scraped is on GitHub (The Register is on the list).
"Training data shall be made available by OpenAI in a secure room on a secured computer without internet access or network access to other unauthorized computers or devices," the judge's order states.
No recording devices will be permitted in the secure room and OpenAI's legal team will have the right to inspect any notes made therein.
OpenAI did not immediately respond to a request to explain why such secrecy is required. One likely reason is fear of legal liability – if the extent of permissionless use of online data were widely known, that could prompt even more lawsuits.
Forthcoming AI regulations may force developers to be more forthcoming about what goes into their models. Europe's Artificial Intelligence Act, which takes effect in August 2025, declares, "In order to increase transparency on the data that is used in the pre-training and training of general-purpose AI models, including text and data protected by copyright law, it is adequate that providers of such models draw up and make publicly available a sufficiently detailed summary of the content used for training the general-purpose AI model."
The rules include some protections for trade secrets and confidential business information, but make clear that the information provided should be detailed enough to satisfy those with legitimate interests – "including copyright holders" – and to help them enforce their rights.
California legislators have approved an AI data transparency bill (AB 2013), which awaits governor Gavin Newsom's signature. And a federal bill, the Generative AI Copyright Disclosure Act, calls for AI models to notify the US Copyright Office of all copyrighted content used for training.
- Altman reportedly asks Biden to back a slew of multi-gigawatt-scale AI datacenters
- US Army drafts AI to combat recruitment shortfall
- Apple ropes off at least 4 GB of iPhone storage to house AI
- AI PCs will dominate shipments by 2026, but not because of demand
The push for training data transparency may concern OpenAI, which already faces many copyright claims. The Microsoft-affiliated developer continues to insist that its use of copyrighted content qualifies as fair use and is therefore legally defensible. Its attorneys said as much in their answer [PDF] last month to the authors' amended complaint.
"Plaintiffs allege that their books were among the human knowledge shown to OpenAI's models to teach them intelligence and language," OpenAI's attorneys argue. "If so, that would be paradigmatic transformative fair use."
That said, OpenAI's legal team contends that generative AI is about creating new content rather than reproducing training data. The processing of copyrighted works during the model training process allegedly doesn't infringe because it's just extracting word frequencies, syntactic partners, and other statistical data.
"The purpose of those models is not to output material that already exists; there are much less computationally intensive ways to do that," OpenAI's attorneys claim. "Instead, their purpose is to create new material that never existed before, based on an understanding of language, reasoning, and the world."
That's a bit of misdirection. Generative AI models, though capable of unexpected output, are designed to predict a series of tokens or characters from training data that's relevant to a given prompt and adjacent system rules. Predictions insufficiently grounded in training data are called hallucinations – "creative" though they may be, they are not a desired result.
No open and shut case
Whether AI models reproduce training data verbatim is relevant to copyright law. Their capacity to craft content that's similar but not identical to source data – "money laundering for copyrighted data," as developer Simon Willison has described it – is a bit more complicated, legally and morally.
Even so, there's considerable skepticism among legal scholars that copyright law is the appropriate regime to address what AI models do and their impact on society. To date, US courts have echoed that skepticism.
As noted by Politico, US District Court judge Vincent Chhabria last November granted Meta's motion to dismiss [PDF] all but one of the claims brought on behalf of author Richard Kadrey against the social media giant over its LLaMa model. Chhabria called the claim that LLaMa itself is an infringing derivative work "nonsensical." He dismissed the copyright claims, the DMCA claim and all of the state law claims.
That doesn't bode well for the authors' lawsuit against OpenAI, or other cases that have made similar allegations. No wonder there are over 600 proposed laws across the US that aim to address the issue. ®
PS: OpenAI's chief research officer Bob McGrew just bailed from the org right after CTO Mira Murati said she was leaving, too. Curiouser and curiouser. CEO Sam Altman confirmed the changes here.