Microsoft CEO of AI: Your online content is 'freeware' fodder for training models
Unless you've got a lawyer, that is
Mustafa Suleyman, the CEO of Microsoft AI, said this week that machine-learning companies can scrape most content published online and use it to train neural networks because it's essentially "freeware."
Shortly afterwards the Center for Investigative Reporting sued OpenAI and its largest investor Microsoft "for using the nonprofit news organization’s content without permission or offering compensation."
This follows in the footsteps of eight newspapers that sued OpenAI and Microsoft over alleged content misappropriation in April, as did the New York Times four months earlier.
Then there are the two authors who sued OpenAI and Microsoft in January alleging that they trained AI models on the authors' works without permission. Also, in 2022, several unidentified developers sued OpenAI and GitHub based on claims that the organizations used publicly posted programming code to train generative models in violation of software licensing terms
Asked in an interview with CNBC’s Andrew Ross Sorkin at the Aspen Ideas Festival whether AI companies have effectively stolen the world's intellectual property, Suleyman acknowledged the controversy and attempted to draw a distinction between content people put online and content backed by corporate copyright holders.
"I think that with respect to content that is already on the open web, the social contract of that content since the 1990s has been it is fair use," he opined. "Anyone can copy it, recreate with it, reproduce with it. That has been freeware, if you like. That's been the understanding."
Suleyman did allow that there's another category of content, the stuff published by companies with lawyers.
"There's a separate category where a website or publisher or news organization had explicitly said, 'do not scrape or crawl me for any other reason than indexing me,' so that other people can find that content," he explained. "But that's the gray area. And I think that's going to work its way through the courts."
That's putting it mildly. While Suleyman's remarks seem certain to offend content creators, he's not entirely wrong – it's not clear where the legal lines are with regard to AI model training and model output.
Most people posting content online as individuals will have compromised their rights in some way by accepting the Terms of Service agreements offered by major social media platforms. Reddit's decision to license its users' posts to OpenAI wouldn't happen if the social media giant thought its users had a valid claim to their memes and manifestos.
- OpenAI, Google ink deals to augment AI efforts with news – it was Time for better sources
- Bill Gates says not to worry about AI gobbling up energy, tech will adapt
- OpenAI develops AI model to critique its AI models
- AI to boost datacenter capex by 28.5% and become the top server workload
The fact that OpenAI and others making AI models are striking content deals with major publishers shows that a strong brand, deep pockets, and a legal team can bring large technology operations to the negotiating table.
In other words, those creating content and posting it online make freeware unless they retain, or can attract, attorneys willing to challenge Microsoft and its ilk.
In a paper distributed via SSRN last month, Frank Pasquale, professor of law at Cornell Tech and Cornell Law School in the US, and Haochen Sun, associate professor of law at The University of Hong Kong, explore the legal uncertainty surrounding the use of copyrighted data to train AI and whether courts will find such use fair. They conclude that AI has to be dealt with at a policy level, because current laws are ill-suited to answer the questions that now need to be addressed.
"Given that there is substantial uncertainty over the legality of AI providers’ use of copyrighted works, legislators will need to articulate a bold new vision for rebalancing rights and responsibilities, just as they did in the wake of the development of the Internet (leading to the Digital Millennium Copyright Act of 1998)," they argue.
The authors suggest that the continued uncompensated harvesting of creative works threatens not just writers, composers, journalists, actors, and other creative professionals, but generative AI itself, which will end up being starved of training data. People will stop making work available online, they predict, if it just gets used to power AI models that reduce the marginal cost of content creation to zero and deprive creators of the possibility of any reward.
That's the future Suleyman anticipates. "The economics of information are about to radically change because we can reduce the cost of production of knowledge to zero marginal cost," he said.
All this freeware that you perhaps helped create can be yours for a small monthly subscription fee. ®