This article is more than 1 year old

How GitHub Copilot could steer Microsoft into a copyright storm

AI-driven coding tool might generate other people's code – who knew? Well, Redmond, for one

Special report GitHub Copilot – a programming auto-suggestion tool trained from public source code on the internet – has been caught generating what appears to be copyrighted code, prompting an attorney to look into a possible copyright infringement claim.

On Monday, Matthew Butterick, a lawyer, designer, and developer, announced he is working with Joseph Saveri Law Firm to investigate the possibility of filing a copyright claim against GitHub. There are two potential lines of attack here: is GitHub improperly training Copilot on open source code, and is the tool improperly emitting other people's copyrighted work – pulled from the training data – to suggest code snippets to users?

Butterick has been critical of Copilot since its launch. In June he published a blog post arguing that "any code generated by Copilot may contain lurking license or IP violations," and thus should be avoided.

That same month, Denver Gingerich and Bradley Kuhn of the Software Freedom Conservancy (SFC) said their organization would stop using GitHub, largely as a result of Microsoft and GitHub releasing Copilot without addressing concerns about how the machine-learning model dealt with different open source licensing requirements.

Many developers have been worried about what Copilot means for open source

Copilot's capacity to copy code verbatim, or nearly so, surfaced last week when Tim Davis, a professor of computer science and engineering at Texas A&M University, found that Copilot, when prompted, would reproduce his copyrighted sparse matrix transposition code.

Asked to comment, Davis said he would prefer to wait until he has heard back from GitHub and its parent Microsoft about his concerns.

In an email to The Register, Butterick indicated there's been a strong response to news of his investigation. 

"Clearly, many developers have been worried about what Copilot means for open source," he wrote. "We're hearing lots of stories. Our experience with Copilot has been similar to what others have found – that it's not difficult to induce Copilot to emit verbatim code from identifiable open source repositories. As we expand our investigation, we expect to see more examples.

"But keep in mind that verbatim copying is just one of many issues presented by Copilot. For instance, a software author's copyright in their code can be violated without verbatim copying. Also, most open-source code is covered by a license, which imposes additional legal requirements. Has Copilot met these requirements? We're looking at all these issues."

Spokespeople for Microsoft and GitHub were unable to comment for this article. However, GitHub's documentation for Copilot warns that the output may contain "undesirable patterns" and puts the onus of intellectual property infringement on the user of Copilot. That is to say, if you use Copilot to auto-complete code for you and you get sued, you were warned. That warning implies that the potential for Copilot to produce copyrighted code was not unanticipated.

'Eager'

When GitHub introduced a beta version of Copilot in 2021, and questions about copyright and licensing were raised, then-CEO Nat Friedman opined "training ML systems on public data is fair use [and] the output belongs to the operator, just like with a compiler. We expect that IP and AI will be an interesting policy discussion around the world in the coming years, and we're eager to participate!"

That participation, incidentally, has included GitHub-funded panel discussions about the impact of AI on open source, at an event run by the Open Source Initiative, which is partly funded by Microsoft.

Kuhn from the SFC told The Register in an email that statements by GitHub's now-ex CEO that these copyright issues are settled law create a false narrative – a point he's made previously.

"We've spoken with Microsoft and GitHub multiple times on this issue and their unsupported anti-FOSS [free and open source software] position has remained disturbingly consistent," he wrote. "We believe that Microsoft and GitHub have made the political calculation that if they keep repeating that what they're doing is acceptable, early and often, that they can make true what is not known to be true."

Yet among those who find tools like Copilot useful, there's hope that assistive AI can be reconciled with our social and legal frameworks. That a model's output won't lead to litigation.

Brett Becker, assistant professor at University College Dublin in Ireland, told The Register in an email, "AI-assisted programming tools are not going to go away and will continue to evolve. Where these tools fit into the current landscape of programming practices, law, and community norms is only just beginning to be explored and will also continue to evolve.

"An interesting question is: what will emerge as the main drivers of this evolution? Will these tools fundamentally alter future practices, law, and community norms – or will our practices, law and community norms prove resilient and drive the evolution of these tools?"

The legal implications of large language models, such as OpenAI's Codex, upon which Copilot is based, and text-to-image models built from datasets compiled by German non-profit LAION, such as Imagen and Stable Diffusion, remain heated topics of discussion. Similar concerns about the images generated by Midjourney have been raised.

Asked whether he believes large language models (LLMs) focused on generating source code are more prone to copyright violations because of the constrained nature of their output, Butterick said he's reluctant to generalize.

"We've also been looking into the image generators – users have already found that DALL-E and Midjourney and Stable Diffusion have different strengths and weaknesses. The same will likely be true for LLMs for coding," he said.

"These questions about Copilot have been raised since it was first available in beta. There are probably some legal questions that will end up being common to all these systems, especially around the handling of training data. Again, we're not the first people to raise these. One big difference between open-source code and images is that images are usually offered under licenses that are more restrictive than open-source licenses."

There are also adjacent social and ethical issues that remain unresolved, such as whether AI-generated code should be considered plagiarism and to what extent creators of the materials used to train a neural network should have a say in that AI model's usage.

In the Texas Law Review in March, 2021, Mark Lemley, a Stanford law professor, and Bryan Casey, then a lecturer in law at Stanford, posed a question: "Will copyright law allow robots to learn?" They argue that, at least in the United States, it should.

"[Machine learning] systems should generally be able to use databases for training, whether or not the contents of that database are copyrighted," they wrote, adding that copyright law isn't the right tool to regulate abuses.

But when it comes to the output of these models – the code suggestions automatically made by the likes of Copilot – the potential for the copyright claim proposed by Butterick looks stronger.

"I actually think there's a decent chance there is a good copyright claim," said Tyler Ochoa, a professor in the law department at Santa Clara University in California, in a phone interview with The Register.

I actually think there's a decent chance there is a good copyright claim

In terms of the ingestion of publicly accessible code, Ochoa said, there may be software license violations but that's probably protected by fair use. While there hasn't been a lot of litigation about that, a number of scholars have taken that position and he said he's inclined to agree.

Kuhn is less willing to set aside how Copilot deals with software licenses.

"What Microsoft's GitHub has done in this process is absolutely unconscionable," he said. "Without discussion, consent, or engagement with the FOSS community, they have declared that they know better than the courts and our laws about what is or is not permissible under a FOSS license. They have completely ignored the attribution clauses of all FOSS licenses, and, more importantly, the more freedom-protecting requirements of copyleft licenses."

But in terms of where Copilot may be vulnerable to a copyright claim, Ochoa believes LLMs that output source code – more so than models that generate images – are likely to echo training data. That may be problematic for GitHub.

"When you're trying to output code, source code, I think you have a very high likelihood that the code that you output is going to look like one or more of the inputs, because the whole point of code is to achieve something functional," he said. "Once something works well, lots of other people are going to repeat it."

Figure left behind on a dock as a boat sails away

Permissive MIT, Apache open-source licenses on the up as developers snub GNU's GPL

READ MORE

Ochoa argues the output is likely to be the same as the training data for one of two reasons: "One is there's only one good way to do it. And the other is [you're] copying basically an open source solution.

"If there's only one good way to do it, OK, then that's probably not eligible for copyright. But chances are that there's just a lot of code in [the training data] that has used the same open source solution, and that the output is going to look very similar to that. And that's just copying."

In other words, the model may suggest code to solve a problem for which there's only really one practical solution, or it's copying from someone's open source that does the same thing. In either case, that's probably because a lot of people have used the same code, and that shows up a lot in the training data, leading to the assistant regurgitating it.

Would that be fair use? It's not clear. Ochoa says the functional nature of the code means that reproducing it in a suggestion may not be seen as particularly transformative, which is one of the criteria for determining fair use. And then there's the issue of whether the copying harms the market when the market is not charging for open source code. If it harms the market, fair use may not apply.

"The problem here is the market doesn't charge you money for these uses," said Ochoa, adding though that the terms of the open source licenses are what the market is most interested in. "If a court thinks those conditions are important, then they'll say, 'yeah, you're harming the market for these works, because you're not complying with the conditions.' [The software creators are] not getting the consideration that they wanted when they created these words in the first place.

"So they're not seeking monetary compensation. They're seeking non-monetary compensation. And they're not getting it. And if they're not getting it, then they're going to be less likely to contribute open source code in the future. In theory, that's harming the market for these works or harming the incentive to produce them."

The generated code thus may not be transformative enough to be fair use, and may harm the market as described – again, potentially derailing a fair use claim.

When Berkeley Artificial Intelligence Research considered this issue back in 2020, the group suggested that perhaps training large language models from public web data is fundamentally flawed, given concerns about privacy, bias, and the law. They proposed that tech companies invest in collecting better training data rather than hoovering up the web. That doesn't appear to have happened.

Kuhn argues the status quo must not stand and adds that the SFC has been discussing Microsoft's GitHub with its litigation counsel for a year now.

"We are at a crossroads in our culture, which was in many ways predicted by science fiction," he said.

"Big Tech companies, in all sorts of ways, are seeking to force upon us their preferred conclusions about the applications of artificial intelligence – regardless of what the law says or what values the community of users, consumers, and developers holds. FOSS, and the inappropriate exploitation of FOSS by Microsoft's GitHub, is just one way of doing this among many. We have to stand up to Big Tech's behavior here, and we plan to."

Asked what the ideal outcome would be, Butterick replied that it's too soon to say.

"There's so much we don't know about how Copilot works," he wrote.

"Certainly, we can imagine versions of Copilot that are friendlier to the rights & interests of open-source developers. As it stands, it's potentially an existential threat to open source.

"Obviously, it's ironic that GitHub, a company that built its reputation and market value on its deep ties to the open source community, would release a product that monetizes open source in a way that damages the community. On the other hand, considering Microsoft’s long history of antagonism toward open source, maybe it's not so surprising. When Microsoft bought GitHub in 2018, a lot of open source developers – me included – hoped for the best. Apparently that hope was misplaced." ®

More about

TIP US OFF

Send us news


Other stories you might like