Opinion Non-techies have discovered AI, and they're in a tizzy. A lot of that tizzy is about how AI is going to outsmart us in some sort of Hollywood dystopia, as deeply ironic as it is deeply wrong. The nature of LLMs isn't HAL-9000 self-awareness, but a giant predictive text machine. That in itself is both science and science fiction, where sufficient awareness of rules and distributions provides what science fiction calls working precognition or as physics has it, a working model.
Once you get past the HAL-9000 fixation, it's clear that AIs are best tested not by psychological probing but by analyzing the output through probabilities. That's an approach already bearing fruit, with a group from UC Berkeley learning more about OpenAI's products than OpenAI is making public.
LLMs decide what to output depending on the rules and distributions learned from training data. Thus, the researchers argue, by looking at what the LLM actually produces, you can make inferences about what data it's been fed, especially if you can test this against data you too already know. In this case, copyrighted works.
That the scientists discovered OpenAI's models had been fed a diet with a preponderance of science fiction and fantasy is both delightful, ironic, and the least surprising thing since a Saturday morning hangover. If God created man in his own image, whoever created ChatGPT was a science fiction and fantasy-obsessed ubergeek with hyperfocus and a neurodiverse approach to empathy. You can throw a stone in the Valley without hitting one of those, but you'll have to throw it very hard towards the ocean.
The researchers reach good and admirable conclusions about open data sets and avoiding bias, but also touch on what may be LLM's first actual explosive point of contact with the world of humans: copyright.
Normally if you feed a book or a musical record into a computer, you make a copy. If the original is copyrighted, the rules are simple: you can only do what the copyright owner lets you. But training a neural network doesn't create a permanent copy, it creates a mathematical set of connections and weights, intermingled with those created from other data. It's analysis and synthesis, something we not only allow in humans but force people to do for a decade of their young lives in the forced education camps we call schools. Or if using copyright works as training data is against the law, we're in deep trouble.
Conversely, if humans learn something verbatim and then make money from recreating it, then copyright law once again applies: an actor can't tour a play in copyright with impunity just because they've learned their lines. Things get really murky with derivative work, where something based on copyright work also needs that copyright holder's permission. And every bit of an LLM's output is derivative of its training data: there is nothing else it can be. This is not only unarguable, it's culturally determined: if you want to see derivative works copyright law impugned on a galactic scale in the real world, go to a con.
- Microsoft cries foul over UK gaming deal blocker but it's hard to feel sorry for them
- If you don't get open source's trademark culture, expect bad language
- Firmware is on shaky ground – let's see what it's made of
- In the battle between Microsoft and Google, LLM is the weapon too deadly to use
Fandom has given us cosplay, fanfic, tribute shows, trademark appropriation, and an obsession with sharing every last quark of a franchise or favored work. Now and again, if something approaches commercial escape velocity, copyright holders may step in, but as a wholesale, global and public display of scofflawyer behavior, science fiction and fantasy fandom is getting away with it. The world is a far better place thereby. It's not that ChatGPT is science fiction that matters, it's that it's a science fiction fan.
LLMs work like a general-purpose fandom, creating new ideas clearly derived from analysis of, among other things, copyright works. How will copyright law react to this? The irony is that while LLMs work by probabilities rather than algorithms, so does copyright. Derivative work has no hard and fast rules over how much derivation counts as making a work derivative. Like its intellectual property sibling, fair use, there are general principles mostly derived from case law, but there are more gray areas than a foggy day in San Francisco. If you need more irony here, consider case law itself as a multi-century exercise in derivative work in an arena where copyright does not apply.
What LLMs bring to the fight is massive deployment. As fandom has found, if enough people do it, it gets done. Audio and video home taping both saw immense pushback by the recording industry – in Sony's case, setting different divisions at war with each other – but put a cheap and popular technology in the hands of billions and it wins.
A lot of people who think about such things are awaiting the court cases that will help define the future of how copyright interacts with AI. These are unlikely to help. Copyright gets less useful and more harmful the further it gets away from dealing with actual copies. LLMs are as generative as they are derivative, and copyright law is just terrible at patrolling generative systems, where non-human entities generate novel work. If a gorilla takes a photo in a forest, does a lawyer get paid? If an AI writes a story about a boy wizard, and a billionth of its training data came from Harry Potter, who does JK Rowling sue? Hard cases make bad law, and these are going to be very hard cases indeed.
Like the sea defenses on an eroding coastline, the concepts of derivative work can either be protected at ruinous expense against the rising levels of automated non-compliance or stage a managed retreat on the principle of least damage. Either way, the robots, like the sea, will win in the end. There will be a new landscape. And it will be perfectly habitable. ®