US House mulls forcing AI makers to reveal use of copyrighted training data

Proposed law doesn't include any ban on use of such stuff to build models, mind you

A bill introduced in the US House of Representatives would require those training AI models to disclose any and all copyrighted works used, and it would apply retroactively.

Proposed yesterday by Congressman Adam Schiff (D-CA), the Generative AI Copyright Disclosure Act [PDF] could prove a huge headache for AI companies using copyrighted work to train large language models and other forms of machine learning systems.

The bill would require "a person who creates a training dataset...that is used in building a generative AI system," to submit notice to the Register of Copyrights with a "sufficiently detailed summary" of any copyrighted works in the training dataset. Alterations to the dataset would also require a submission and in both instances a URL for the training dataset would have to be provided and be put on a public database. 

Notice would have to be filed in a timely manner, too - the Copyright office would have to be given a list of works within 30 days of an AI system trained on such a dataset being made public. AI systems trained on copyrighted works prior to the passage of the bill would all have 30 days to get a list in as well. 

The bill includes a somewhat nebulous noncompliance penalty of at least $5,000 for failure to send a list to the Register of Copyrights.

"AI has the disruptive potential of changing our economy, our political system, and our day-to-day lives,"Schiff said in a canned statement. "We must balance the immense potential of AI with the crucial need for ethical guidelines and protections." 

Schiff, who is running for a Senate seat in California this year, said the bill "is about respecting creativity in the age of AI and marrying technological progress with fairness." 

A number of creative trade groups have endorsed the legislation, including the Recording Industry Association of America, the Screen Actors Guild, and both the East and West divisions of the Writers Guild of America.

"This bill is an important first step in addressing the unprecedented and unauthorized use of copyrighted materials to train generative AI systems,"  said WGA-West president Meredith Stiehm. "Greater transparency and guardrails around AI are necessary to protect writers and other creators."

AIs trained on popular writers, artists and musicians can regurgitate partial imitations of their works - a fact that recently drew the ire of hundreds of musicians.

A group called The Artists Rights Alliance launched a petition earlier this month to end the use of copyrighted music to train AIs, calling it "a race to the bottom that will degrade the value of our work and prevent us from being fairly compensated," as well as an assault on creativity. 

Other creative types, writers and artists among them, have railed against the use of their works to train AIs and filed lawsuits, albeit unsuccessfully, to stop AI being trained on their content. 

It's not immediately clear how AI firms will react to the bill - we've asked and will update this story if we hear back - but we note OpenAI has said it's currently impossible to train a good AI model without relying on copyrighted content.

Those relying on copyrighted materials may end up unhappy that they have to disclose what they've trained their models on. But the bill does nothing to prohibit the use of copyrighted works to train AI - the legislation just requires it be in the public record. ®

More about


Send us news

Other stories you might like