GitHub's Copilot flies into its first open source copyright lawsuit
It won't be the last
Opinion GitHub Copilot, Microsoft's AI-driven, pair-programming service, is already wildly popular. Microsoft broke out GitHub's revenue and subscription numbers in its latest quarterly report for the first time.
GitHub now has an annual recurring revenue of $1 billion, up from a reported $200 to $300 million when it was acquired. It now boasts 90 million active users on the platform, up from last November's 73 million. Much of its recent revenue and subscriber jump can be ascribed to Copilot. Too bad the party may soon be over.
When Copilot first rolled out the door, smart people were concerned because its Machine Learning (ML) model was based on OpenAI's Codex; it included code that had been copyrighted under one open-source license or another. After all, the Codex had been trained on billions of publicly available source code lines – including code in public repositories on GitHub. That included, among other things, all of the Apache Foundation's many projects' code.
So, it is no surprise that Matthew Butterick, a lawyer, designer, and developer, announced he was working with Joseph Saveri Law Firm, a major class action law firm, to investigate the possibility of filing a copyright claim against GitHub. That possibility has become a reality.
On November 3rd, they filed a class-action lawsuit against Microsoft and partners in the US District Court for the Northern District of California. Their claim? Copilot is an AI-based system trained on publicly accessible open-source licensed code [PDF]. While GitHub claims that the code it produces for programmers is not mere copies of the code, in reality, the suit claims, that's exactly what it is. "Defendants claim Codex and Copilot do not retain copies of the materials they are trained on. In practice, however, the Output is often a near-identical reproduction of code from the training data."
Further, the "Codex does not identify the owner of the copyright to this Output, nor any other—it has not been trained to provide Attribution. Nor does it include a Copyright Notice nor any License Terms attached to the Output. This is by design — Codex was not coded or trained to reproduce such data."
In short, they are alleging Copilot is just a copyright-breaking copycat.
- Businesses should dump Windows for the Linux desktop
- Is it time to retire C and C++ for Rust in new programs?
- Open-source leaders' reputations as jerks is undeserved
- Securing open-source code isn't going to be cheap
Microsoft can't argue the facts. Copilot is based on open source code. The real question is whether their actions violated the code's copyright. Is it "fair use," or is it intellectual property theft? That, my friend, is a complicated question. It will not be solved quickly. Butterick knows this.
"This is the first step in what will be a long journey. As far as we know, this is the first class-action case in the US challenging the training and output of AI systems. It will not be the last. AI systems are not exempt from the law. Those who create and operate these systems must remain accountable."
GitHub, of course, claims, "We've been committed to innovating responsibly with Copilot from the start, and will continue to evolve the product to best serve developers across the globe."
That doesn't say much, does it?
Microsoft and OpenAI haven't commented yet on the suit. That will come in time. This case will not be going away. Eventually, they'll need to address the claims. Then, for I see no chance of this being settled out of court, it will start its long slow journey through the US legal system. I don't expect to see a definitive answer this decade.
In the meantime, open source leaders are still considering all the ramifications of this lawsuit. Open Source Initiative (OSI) veteran Simon Phipps mentioned on Mastodon that he thinks "the only thing that it's safe to conclude at this point about Copilot is the legal uncertainty makes it inappropriate for use in open-source projects."
The Software Freedom Conservancy (SFC) explained that while the "issue is dire and important," it's not simple. For example, one important principle of open-source license issues is "Community-oriented enforcement must never prioritize financial gain." By its very nature, a class-action suit is inclined to be about financial compensation.
The SFC hopes the plaintiffs will "endorse these principles. We do share your frustration and anger that Microsoft's GitHub has continued its infringement and Microsoft and GitHub's refusal to work with the community regarding their aggressive anti-FOSS activity and unprecedented license violation. However, FOSS licensing is not primarily about business models, or financial recovery. GitHub's actions with Copilot are offensive primarily because they seek to undermine the system of copyleft that is specifically designed to assure that users, developers, and consumers all have equal rights."
The dangers are also potentially great for Copilot users. If the case against GitHub wins out, every last bit of code you've produced using it may be subject to a variety of open-source licenses. If that doesn't scare you, talk to your company's lawyers. You may notice that they'll turn white once they get their minds around it.
Make no mistake about it. This lawsuit – win, lose, or draw – will change how we use open source software and AI/ML. Indeed, it's likely to change the entire technology world. Hang on, we're going to be in for a rough ride. ®