This article is more than 1 year old
Open source licenses need to leave the 1980s and evolve to deal with AI
Time to get with the program... before artificial intelligence does
Opinion Free software and open source licenses evolved to deal with code in the 1970s and '80s. Today it must again transform to deal with AI models.
AI was born from open source software. But the free software and open source licenses, based on copyright law, to deal with software code are not a good fit for the large language model (LLM) neural nets and datasets that fuel AI's open source software. Since many programming datasets, in particular, are based on free software and open source code, something must be done. And that's why Stefano Maffulli, Open Source Initiative (OSI) executive director, and a host of other open source and AI leaders are working on combining AI and open source licenses in ways that will make sense for both.
Lest you think this is some kind of theoretical, legal discussion with no impact on the real world, think again. Consider J. Doe 1 et al vs GitHub. The plaintiffs in this case in the United States Northern District Court of California allege Microsoft, OpenAI, and GitHub, via their commercial AI-based system, OpenAI's Codex and GitHub's Copilot, had ripped off their open source code. The result? The plaintiffs claim that "suggested" code consists of often near-identical copies of code scraped from public GitHub repositories, without the required open source license attributions.
This case continues. The amended complaint includes accusations of violating the Digital Millennium Copyright Act, breach of contract (open source license violations), unfair enrichment, and unfair competition claims, and breach of contract (selling licensed materials in violation of GitHub's policies).
Don't think this kind of lawsuit is just Microsoft's problem. It's not. Sean O'Brien, a Yale Law School lecturer in cybersecurity and founder of the Yale Privacy Lab, told my colleague David Gewirtz: "I believe there will soon be an entire sub-industry of trolling that mirrors patent trolls, but this time surrounding AI-generated works. A feedback loop is created as more authors use AI-powered tools to ship code under proprietary licenses. Software ecosystems will be polluted with proprietary code that will be the subject of cease-and-desist claims by enterprising firms."
He's right. I've been covering patent trolls for decades. I guarantee that licensing trolls will come after "your" ChatGPT and Copilot code.
Some people, such as Felix Reda, a German researcher and politician, claim that all AI-produced code is public domain. US attorney Richard Santalesa, a founding member of the SmartEdgeLaw Group, observed to Gewirtz that there are contract and copyright law issues. They're not the same thing. Santalesa believes companies producing AI-generated code will "as with all of their other IP, deem their provided materials – including AI-generated code – as their property." In any case, however, public domain code is not the same thing as open source code.
- Will Flatpak and Snap replace desktop Linux native apps?
- Red Hat promises AI trained on 'curated' and 'domain-specific' data
- EU's Cyber Resilience Act contains a poison pill for open source developers
- Here's how the data we feed AI determines the results
On top of all that, there's the whole issue of how the datasets should be licensed. There are many "open" datasets under numerous open source licenses, but it's not usually a good fit.
In our conversation, Open Source Initiative's Maffulli elaborated on how various artifacts produced by AI and machine learning systems fall under different laws and regulations. The open source community must determine which laws best serve their interests. Maffulli compared the current situation to the late '70s and '80s when software emerged as a distinct discipline, and copyright began to be applied to the source and binary codes.
We're at a similar crossroads today. AI programs such as TensorFlow, PyTorch, and Hugging Face Hub work well under their open source licenses. The new AI artifacts are another story. Datasets, models, weights, etc. don't fit squarely into the traditional copyright model. Maffulli argued that the tech community should devise something new that aligns better with our objectives, rather than relying on "hacks."
Specifically, open source licenses designed for software, Maffulli noted, might not be the best fit for AI artifacts. For instance, while MIT License's broad freedoms could potentially apply to a model, questions arise for more complex licenses like Apache or the GPL. Maffulli also addressed the challenges of applying open source principles to sensitive fields like healthcare, where regulations around data access pose unique hurdles. The short version of this is that medical data can't be open sourced.
Simultaneously, most commercial LLMs datasets are black boxes. We literally don't know what's in them. So we end up, as the Electronic Frontier Foundation (EFF) puts it, in a situation where we have "Garbage In, Gospel Out." We need, the EFF concludes, open data.
So it is that the OSI, said Maffulli, together with Open Forum Europe, Creative Commons, Wikimedia Foundation, Hugging Face, GitHub, the Linux Foundation, ACLU Mozilla, and the Internet Archive are working on a draft for defining a common understanding of open source AI principles. This will be "critical in conversations with legislative bodies." Even now, EU, US, and UK government agencies are struggling to develop AI regulation, and they're woefully under-equipped to deal with the issues.
Stefano concluded by saying we should start with "a return to the basics," the GNU Manifesto, which predates most licenses and sets the "North Star" for the open source movement. Maffulli suggested that its principles remain surprisingly relevant when applied to AI systems. By focusing on first principles, we'll be better able to navigate this complex intersection of AI and open source. ®