Reddit: If you want to slurp our API to train that LLM, you better pay for it, pal
End of free money era and end of free data for building billion-dollar models
In a move seemingly designed to stop being used as a free training library for large language models, megaforum Reddit said it's going to begin charging companies who make excessive use of its data-downloading API.
"As a platform with one of the largest corpus of human-to-human conversations online, spanning the past 18 years, we have an obligation to our communities to be stewards of this content," Reddit said.
The Reddit Data API was ostensibly released to help developers build apps and services for Reddit users by allowing access to posts and other info hosted on Reddit. It's also used by academics, researchers and "social listening tools" to get access to Reddit data, the company said, but some people are using it excessively.
By some people, we imagine Reddit means orgs like OpenAI, which for its GPT series has used petabytes of information from Wikipedia, libraries of books, webpages linked to from Reddit posts, and much more.
While not naming companies like Google and OpenAI directly, Reddit CEO and cofounder Steve Huffman told The New York Times in an interview that Reddit "is a home for authentic conversation" online, and as such "the Reddit corpus of data is really valuable," to third parties.
"Crawling Reddit, generating value and not returning any of that value to our users is something we have a problem with," Huffman opined. "It's a good time for us to tighten things up. We think that's fair."
- Predict stocks, foresee public opinion, all kinda possible with ChatGPT-like models
- Adobe: Take user data to train generative AI models? We'd never do that
- Publisher halts AI article assembly line after probe
- Why ChatGPT should be considered a malevolent AI – and be destroyed
As part of the new terms, Reddit said it "reserves the right to charge fees for access and use of Reddit Services and Data, rates to be determined at Reddit's sole discretion." Prohibitions on "access or use [of] the Reddit Services and Data through any means to train large language, artificial intelligence, or other algorithmic models" are also included.
That said, on a Reddit help page covering commercial use and fees for Reddit's developer tools, the site said that use of the site's dev tools (which according to the company includes APIs) for commercial purposes, including "selling access to models trained on Reddit data" is allowed with permission, and presumably payment of an associated fee.
Reddit didn't give any clue as to what qualifies as "additional capabilities, higher usage limits, and broader usage rights" that it said would be the determining factor for who has to pay it for Data API access, nor did it give any clue as to how much such third parties would need to shell out for the privilege.
The company also announced new and updated native moderator tools for the Reddit platform today, including additional mod queues, new rule management features and a mod log.
Legally this could get interesting
Interestingly enough, Reddit also said that it has updated its terms to "further [clarify] that user content is owned by redditors that have created and submitted content on Reddit and cannot be used without permission," which could be a real thorn in the side of anyone seeking to scrape the site for the nearly two decades of conversations it contains.
Reddit's user agreement includes carve outs for its own use of content published by posters, including "the right for us to make your content available [to] other companies, organizations, or individuals who partner with Reddit." This makes it a bit fuzzy as to whether or not content ownership is an issue if the party who wants access to the data has permission from Reddit.
Regarding that gap, a Reddit spokesperson told us that it would have more information to share on June about how permission will be granted as it rolls out its paid access offering. That's when we'll be told more about pricing, too, the spokesperson said.
When asked what sort of use thresholds developers would be looking at before being asked to pay, Reddit told us that it's always had rate limits in place for its API usage. Reddit didn't bother to tell us what those rate limits are or if they were going to change under the new program, but GitHub documentation last updated in 2015 indicates it's 60 requests per client per minute with no mention of bulk limits.
What Reddit's spokesperson did tell us is that the company's never been very good about enforcing API usage limits or "clearing space" for a premium tier with increased limits.
Reddit said developers and third parties will be notified via email of the changes beginning today, and that the new rules will generally go into effect in June 19. The spokesperson we talked to also wanted to make clear that the Data API was still freely accessible for appropriate use cases through the Reddit developer platform; hopefully app developers and other small-scale operators won't have any surprises ahead this summer. ®