GitHub accused of varying Copilot output to avoid copyright allegations
Copilot code-cloning case clarifies claims
GitHub is alleged to have tuned its Copilot programming assistant to generate slight variations of ingested training code to prevent output from being flagged as a direct copy of licensed software.
This assertion appeared on Thursday in the amended complaint [PDF] against Microsoft, GitHub, and OpenAI over Copilot's documented penchant for reproducing developers' publicly posted, open source licensed code.
The lawsuit, initially filed last November on behalf of four unidentified ("J. Doe") plaintiffs, claims that Copilot – a code suggestion tool built from OpenAI's Codex model and commercialized by Microsoft's GitHub – was trained on publicly posted code in a way that violates copyright law and software licensing requirements and that it presents other people's code as its own.
Microsoft, GitHub, and OpenAI tried to have the case dismissed, but managed only to shake off some of the claims. The judge left intact the major copyright and licensing issues, and allowed the plaintiffs to refile several other claims with more details.
- GitHub, Microsoft, OpenAI fail to wriggle out of Copilot copyright lawsuit
- Microsoft, GitHub, OpenAI urge judge to bin Copilot code rip-off case
- Microsoft's Copilot AI to pervade the whole 365 suite
- Microsoft would rather spend money on AI than give workers a raise
The amended complaint – now covering eight counts instead of twelve – retains accusations of violating the Digital Millennium Copyright Act, breach of contract (open source license violations), unfair enrichment, and unfair competition claims.
It adds several other allegations in place of those sent back for revision: breach of contract (selling licensed materials in violation of GitHub's policies), intentional interference with prospective economic relations and negligent interference with prospective economic relations.
The revised complaint adds one additional "J. Doe" plaintiff whose code Copilot has allegedly reproduced. And it includes sample code written by the plaintiffs that Copilot has supposedly reproduced verbatim, although only for the court – the code samples have been redacted in order to prevent the plaintiffs from being identified.
The judge overseeing the case has permitted the plaintiffs to remain anonymous in court filings because of credible threats of violence [PDF] directed at their attorney. The Register understands that the plaintiffs are known to the defendants.
A cunning plan?
Thursday's legal filing says that in July 2022, in response to public criticism of Copilot, GitHub introduced a user-adjustable Copilot filter called "Suggestions matching public code" to avoid seeing software suggestions that duplicate other people's work.
"When the filter is enabled, GitHub Copilot checks code suggestions with their surrounding code of about 150 characters against public code on GitHub," GitHub's documentation explains. "If there is a match or near match, the suggestion will not be shown to you."
However, the complaint contends the filter is essentially worthless because it only checks for exact matches and does nothing to detect output that has been slightly modified. In fact, the plaintiffs suggest that GitHub is trying to get away with copyright and license violations by varying Copilot's output so that it doesn't appear to have been copied exactly.
"In GitHub’s hands, the propensity for small cosmetic variations in Copilot’s Output is a feature, not a bug," the amended complaint says. "These small cosmetic variations mean that GitHub can deliver to Copilot customers unlimited modified copies of Licensed Materials without ever triggering Copilot’s verbatim-code filter."
The court filing points out that machine learning models like Copilot have a parameter that controls the extent to which output varies.
"On information and belief, GitHub has optimized the temperature setting of Copilot to produce small cosmetic variations of the Licensed Materials as often as possible, so that GitHub can deliver code to Copilot users that works the same way as verbatim code, while claiming that Copilot only produces verbatim code one percent of the time," the amended complaint says. "Copilot is an ingenious method of software piracy."
Microsoft's GitHub in an email insisted otherwise.
"We firmly believe AI will transform the way the world builds software, leading to increased productivity and most importantly, happier developers," a company spokesperson told The Register. "We are confident that Copilot adheres to applicable laws and we’ve been committed to innovating responsibly with Copilot from the start. We will continue to invest in and advocate for the AI-powered developer experience of the future."
OpenAI did not respond to a request for comment. ®