Even robots have the right to learn from open source
Just because it's Microsoft doesn't mean it's wrong
Opinion If the soap opera of Microsoft's relationship with open source had a theme tune, it'd be "The Long and Winding Goad".
To a company whose entire existence depended on market control, open source's radical freedoms were an existential, cancerous threat. In return, open source was only too happy to play the upstart punk movement to Microsoft's bloated prog rock.
In the end, both sides accepted the inevitable. Redmond wasn't going to control the cloud and mobile the way it controlled business IT, and the cloud and mobile loved open source. Interoperability was more profitable than insults. For its part, open source was, well, open. It couldn't stop Microsoft's newfound friendliness so wary acceptance became the new world order.
The old animosities are not dead, they only slumber. Many FOSS fiends were deeply suspicious of Microsoft's 2018 acquisition of GitHub, global hive of open-source activity and itself based on the Git version control software written by FOSS patron saint Linus Torvalds. Bad things must surely follow, they said, even if they couldn't quite say what. Now they can. Enter the Software Freedom Conservancy (SFC), jealous guardians of FOSS rights, asking all open-source developers to stop using GitHub for reasons of exploitation.
Put simply, Microsoft has been industriously mining the code in the GitHub repositories and feeding it to an AI to train it in programming. The result, called Copilot, is then sold to programmers as a code suggestion aid. The SFC says that as the proper attribution isn't given to the training data which contributes to the output code, and open-source licenses are frequently very hot on proper attribution, this is exploitation and not to be tolerated.
It's an interesting argument that highlights some important unresolved issues in AI and the use of intellectual property. It's also wrong, on ethical, historical, practical, and philosophical grounds.
The immediate unresolved issue is simply that you can't legislate for novel use. Laws about control of use of land didn't foresee the invention of aircraft that technically trespass if they fly overhead. Phonographs didn't work with sheet music IP law, radio didn't work with phonograph IP, the internet didn't work with broadcast radio IP. And AI training data is a novel use of IP that, if it's a problem, will need new developments, either by way of new law or through new practical licensing protocols. If existing open-source licenses had explicit provision for AI training, the SFC's beef would be as beefless as a vegan burger.
- We need a Library of Congress – but for the digital world
- Cloudflare's outage was human error. There's a way to make tech divinely forgive
- AI's most convincing conversations are not what they seem
- TSMC and China: Mutually assured destruction now measured in nanometers, not megatons
Yet we don't need to go there. One of the major benefits of open source is that it is a conduit for ideas. Every FOSS file is a textbook, a teacher, a resource for learning. Knowledge shared is a common good, a force multiplier for the creative intellect. Proprietary software that hides itself away is a rich person's private library, its useful life heavily curtailed. Nobody has ever expected a human programmer, trained through open source, to attribute everything that contributed to their skills, to be repeated for all new code they write. If it's not immoral for humans, how can it be for AIs?
Likewise, what are the practicalities? Are we to follow the example of the "where there's a hit, there's a writ" music industry, where a vague similarity between phrases in two songs can keep the courts busy for months? That's going to be plenty tough on millions of coders who are used to cutting and pasting a 10-line routine, changing parameter names to suit, without complying with full license requirements.
As for historically, the gripe seems more akin to the complaints against innovation that happens every time a machine does a human task. It's the history of tech, from the Luddites sabotaging textile machinery that operated in a "fraudulent and deceitful manner" to the 1970s UK Musicians Union trying to ban synthesizers because they took work away from skilled musicians. Each case depends on its merits, but the transition always happens. In IT especially, which is entirely about machines taking over human tasks, the deal with the devil has long been done.
Finally, the philosophy that making money out of open-source ideas is somehow wrong in itself is plain nonsense. The FOSS world gives employment to countless workers and is at the heart of countless billions of dollars of economic activity. Even Linus has to eat.
A much better argument is that if Microsoft isn't documenting its training data well enough to identify source files, the training data itself is suspect, and undesirable outputs will be harder to diagnose. But that's a reason to avoid Copilot, not to abstain from GitHub.
It's good to learn there are new ways to use FOSS. It's good to decide how you want this to apply to your own code. It's not good to let old prejudices guide those deliberations. There are untold instances of exploitation in this sorrowful world: robot teachers ain't one. ®