AI training license will allow LLM builders to pay for content they consume
UK org backing it promises 'legal certainty' for devs, money for creators... but is it too late?
Updated A UK non-profit is planning to introduce a new licensing model which will allow developers of large language models to use copyrighted training data while paying the publishers it represents.
The Copyright Licensing Agency (CLA) has intends to launch a Generative AI Training Licence, which is set to be available in the third quarter of 2025.
It said the licence would pay publishers and authors – especially those unable to negotiate direct licensing deals – and give AI developers of all sizes the “legal certainty” they need to access copyrighted training data.
The CLA is a not-for-profit organization representing licensing groups in the UK. It said Publishers’ Licensing Services and ALCS Authors’ Licensing and Collecting Society would be part of the launch of the Generative AI Training Licence.
Mat Pfleger, CEO of CLA, said: “Training AI models on copyrighted content requires permission and compensation. CLA’s collective licence will further demonstrate that licensing is the answer and can provide a market-based solution that is efficient and effective. Our goal is to provide a clear, legal pathway for access to quality content. One that empowers innovators to develop transformative GAI technologies whilst respecting copyright and compensating rightsholders and creators where their works are used.”
The problem the CLA is likely to face is that the tech industry is not known to wait for legal certainty before it develops products, strikes M&A deals, builds social media platforms, or launches software audits. In fact, it could be argued that legal uncertainty has created the fertile ground from which it has grown to dominate government, commerce and culture so effectively. By the time legal certainty arrives, the horse has bolted, boarded a flight to Mauritius, and is sipping a gin and tonic by the pool.
For example, $300-billion-valued OpenAI has responded to a US government consultation by saying it should have access to any data it wants to train GenAI models, and to stop foreign countries from trying to enforce copyright rules against it and other American AI firms.
Meanwhile, the UK government consultation on AI and copyright recently closed. It proposed copyright exemptions for text and data mining (TDM).
"Exploring a TDM exception with rights reservation mechanisms, underpinned by enhanced transparency measures, may be a viable route for facilitating the agreement of licences. This will meet the needs of both rights-holders and AI developers," it said.
This is the position favored by the Oracle-backed think-tank, the Tony Blair Institute for Global Change.
- It's fun making Studio Ghibli-style images with ChatGPT – but intellectual property is no laughing matter
- Copyright-ignoring AI scraper bots laugh at robots.txt so the IETF is trying to improve it
- Writing for humans? Perhaps in future we'll write specifically for AI – and be paid for it
- Do AI robo-authors qualify for copyright? It's still no, says appeals court
Then there is the question of copyrighted material already used for training data. Books3 is a commonly used dataset, with 196,640 books in plain text format, which the UK's Publishers Association said has allowed copyright infringement on an "absolutely massive scale".
Meanwhile, The Atlantic has alleged Meta, along with other genAI devs, may have accessed millions of copyrighted books and research papers through dataset LibGen. Researchers have speculated that OpenAI may have done the same, with the allegations a part of lawsuits over the alleged use of copyrighted material. UK authors were alarmed to find their copyrighted books on the database. ®
Updated to add on April 25:
A spokesperson for the CLA said the license would offer a framework for AI developers to obtain the rights they need to innovate and enhance their GAI models using "high quality, curated content for training, fine-tuning and retrieval augmented generation (RAG)".
"In its consultation, the UK government proposed a controversial new copyright exception for text and data mining which rightsholders and creators opposed. CLA's new Generative AI Training Licence demonstrates that a copyright exception is neither necessary nor desirable and that licensing is the answer and can provide a market-based solution that is fair and practical," they said.
The spokesperson said "past infringements" should not be forgotten about. "It is important that copyright is respected and rights of creators and rightsholders upheld. The CLA Generative AI Training Licence will offer a retrospective rights option," they added.