This article is more than 1 year old
Whose line is it anyway, GitHub? Innovation, not litigation, should answer
If Jesus was my Copilot, what would he do?
Opinion Open source. It's open. You can look. Mostly, you can use. There's a clue in the name. Not so fast, claims a class action brought against Microsoft, OpenAI and GitHub. Copilot, an in-IDE AI-powered and open source trained suggestion bot, works by offering lines of code to programmers - and that, the class action suit alleges, breaks the rules, and is being sneaky in trying to hide it. A judge has ruled that some of the claims deserve their day in court. Dear lord, not another copyright battle.
GitHub accused of varying Copilot output to avoid copyright allegations
READ MORETechnology can look very odd to judges. Say you legally purchase an ebook. How do you get it? Routers and caching servers each make copies of the book as it's delivered, but they haven't paid a penny. Are the owners of internet infrastructure breaking copyright billions of times a day? You might think that's a daft question, but it bothered the UK's Supreme Court enough to go to Europe to ask "Is this Internet actually legal?" Don't be so bloody daft, came the reply. We miss Europe.
How many of the claims against Microsoft, Copilot and OpenAI's code prompter will fall into the bloody daft box remains to be seen. Nobody foresaw AI ingesting global databases of open source code when the rules were written. Then again, nobody foresaw search engines doing wholesale ingestion, analysis and presentation of all the content. That certainly has its problems, but the consensus is that it's too useful and not damaging enough to outlaw. Copilot and other machine learning systems that feed on Internet content are much the same in that respect as search engines. So the question is, is the result not useful enough or too damaging to accept? Where's the balance of interests?
There are useful ways to approach the issues, and they involve - corporate management look away now - ethics. Yes, really, that briefly fashionable chatter about ethical AI offers a concrete way forward that will work a lot better than lawsuits.
Bent out of shape as it is by special interests, the heart of intellectual property law is that the creator's reasonable wishes should be respected. If software is open source, then the creator reasonably wishes people to be able to read it and put them to use. Something that encourages this doesn't seem the worst sin in the world.
Perhaps it's the way it does it, presenting the code suggestions out of context. There are lots of open source licenses, after all, and some may contain conditions that our happy Copilot cut and paster should know about. Well, assuming Copilot can recognize when it's suggesting someone else's code, it's not unreasonable that it can report the licensing conditions it's offered under. That puts the onus on the coder to comply, which is more ethical than offering up temptation while hiding the consequences. Might even improve the hit rate for following open source rules.
What if the original coder really doesn't want their stuff squeezed through the bowels of Copilot? The search engine world tackled that by the invention of robots.txt. Put a file of that name in your web root directory, and you're putting up a "No Entry" sign for web crawlers. Things are a bit more advanced these days, so putting that sort of function into the fabric of GitHub with whatever sort of fine tuning best expresses creator intent would be nice. In any case, telling content providers: "You don't want your stuff in our search results? Fine." has tended to focus minds on ways to live with it. Giving people choices while explaining the consequences? Nice.
Even if giving people the right to remove their code from Copilot and the like results in a ton of good stuff going away, that's not the end of the world. There's the "cleanroom principle", which smashed IBM's dominant position in the 1980s while accelerating the market like crazy. This is something machine learning could learn a lot from.
The original IBM PC was almost entirely open source. IBM published a technical manual with full circuit diagrams, all using standard chips connected together in standard ways that the chipmakers gave away for free. Designing a functionally equivalent (yet non-copyright) IBM PC clone was something thousands of electronic engineers could do, and hundreds did.
The legal landmine in the beige box was the BIOS, Basic INput-OUtput System, a relatively small chunk of permanent software that provided a standard set of hardware services to operating systems and applications through interrupts - what would be called an API today. If you just copied that code for your clone, IBM would have you bang to rights. You could rewrite the code, but IBM could then tie you up in lawsuits making you prove you didn't copy any of it. Even if you won, the delay and expense would sink you.
- Microsoft's Azure mishap betrays an industry blind to a big problem
- Windows XP's adventures in the afterlife shows copyright's copywrongs
- If you don't get open source's trademark culture, expect bad language
- In the battle between Microsoft and Google, LLM is the weapon too deadly to use
Cue the cleanroom. Cloners hired coders who'd never read a line of IBM's BIOS, and forbade them from doing so. These programmers were given the API, which was not copyright, and told to write to that spec. With legal attestations the cloners were happy to swear to in court, the principle that you cannot copy what you haven't seen held - and the last bit of the jigsaw in the original Clone Wars was in place. That APIs provide such a powerful antidote to copyright has led many to try and change their legal status, most recently Google v Oracle. That ended up in the US Supreme Court where it, like all others, failed.
So, take two automated systems, one dedicated to finding and isolating interfaces within code, and one dedicated to applying rules to generate code that provides those interfaces. There's no transfer of lines of code across the virtual air gap. Automated testing of original versus AI code would increase quality. En passant, a very fine set of tools for refactoring would be born, to the benefit of all. Sounds ethical, right?
There we have it. If there are genuine problems with what Copilot is doing, then there are multiple ways to avoid them while preserving utility and creating new benefits. Playing by the rules while making things better? That's a good line to take. ®