Medium asks AI bot crawlers: Please, please don't scrape bloggers' musings
OpenAI and Google might respect robots.txt but how about the others?
Blogging platform Medium would like organizations to not scrape its articles without permission to train up AI models, and warned this policy may be difficult to enforce.
CEO Tony Stubblebine on Thursday explained how Medium intends to curb the harvesting of people's written work by developers seeking to build training data sets for neural networks. He said, above all, devs should to ask for consent – and offer credit and compensation to writers – for training large language models on people's prose.
Those AI models can end up aping the writers they were trained on, which feels to some like a double injustice: the scribes weren't compensated in the first place, and now models are threatening to take their place as well as income derived from their work.
"To give a blunt summary of the status quo: AI companies have leached value from writers in order to spam internet readers," he wrote in a blog post. "Medium is changing our policy on AI training. The default answer is now: no."
Medium has thus updated its websites' robots.txt file to ask OpenAI's web crawler bot GPTBot to not copy content from its pages. Other publishers – such as CNN, Reuters, the Chicago Tribune, and the New York Times – have already done this.
Stubblebine called this a "soft block" on AI: it relies on GPTBot heeding the request in robots.txt to not access Medium's pages and lift the content. But other crawlers can and may ignore it. Medium could wait for those crawlers to provide a way to block them via robots.txt, and update its file accordingly, but that's not a situation guaranteed to happen.
For what it's worth, though, not only does OpenAI support blocking via robots.txt, so too does Google, which also on Thursday detailed how to block its AI training crawlers for its Bard and Vertex generative API services, again via robots.txt. Medium has yet to update its robots.txt to exclude Google's AI training spiders.
Blocking web crawlers at a level lower than robots.txt, such as by IP address or user agent string, will work, too – until the bots get new IP addresses or alter their user agent strings. It's a game of whack-a-mole that may be too tedious to play.
"Unfortunately, the robots.txt block is limited in major ways," Stubblebine said. "As far as we can tell, OpenAI is the only company providing a way to block the spider they use to find content to train on. We don't think we can block companies other than OpenAI perfectly."
By that he means that at least OpenAI, and now Google, has promised to observe robots.txt. Other orgs collecting data for machine-learning training might just ignore it.
- How to spot OpenAI's crawler bot and stop it slurping sites for training data
- Authors Guild sues OpenAI for using Game of Thrones and other novels to train ChatGPT
- OpenAI urges court to throw out authors' claims in AI copyright battle
That all said, regardless of robots.txt protections, Medium has promised to send cease and desist letters to those crawling its pages without permission for articles to train models.
So, effectively: Medium has asked OpenAI's crawler to leave it alone, at least, and the website will take other data-set crawlers to task via legal threats if they don't back off. The website's terms-of-service were updated to forbid the use of spiders and other crawlers to scrape articles without Medium's consent, we're told.
Stubblebine also warned writers on the platform that it's not clear whether copyright law can protect them from companies training models on their work and using those models to produce similar or almost identical material, amid multiple ongoing lawsuits into that whole thing.
The CEO also reminded Medium users that no one can resell copies of their work on the site without permission. "In the default license on Medium stories, you retain exclusive right to sell your work," Stubblebine wrote.
Textbook publishers sue shadow library LibGen for copyright infringementMEANWHILE
He went on to say that some AI developers may have done just that: bought or obtained copies of articles and other works scraped off Medium and other parts of the internet by third-party resellers, to then train networks on that content. He dubbed that laundering of people's copyrighted material "an act of incredible audacity."
Stubblebine advised companies looking to crawl web data from Medium to contact the site to discuss credit and compensation among other sticking points. "I'm saying this because our end goal isn't to block the development of AI. We are opting all of Medium out of AI training sets for now. But we fully expect to opt back in when these protocols are established," he added.
Medium proposed that if an AI maker were to offer compensation for scraped text, the blogging biz would give 100 percent of this to its writers.
In July, it also confirmed that although AI-generated posts aren't completely banned, it would not be recommending any text completely written by machines.
"Medium is not a place for fully AI-generated stories, and 100 percent AI-generated stories will not be eligible for distribution beyond the writer's personal network," it stated. ®