Reddit to Perplexity: Get your filthy hands off our forums
Social media site continues legal campaign against those who take its content without a license
Updated Reddit on Wednesday filed a lawsuit against Perplexity AI and three of its alleged data dealers for trafficking in unlawfully scraped information.
The complaint, filed in the Southern District of New York, claims that Oxylabs UAB, AWM Proxy, and SerpApi unlawfully bypassed Reddit's and Google's defenses to harvest Reddit content and related search results. It also says that Perplexity chose to purchase the purloined data rather than license it from Reddit.
Ben Lee, chief legal officer at Reddit, told The Register in an emailed statement that AI companies are desperate for quality content generated by real people and that need is fueling an industrial scale data laundering economy.
"Scrapers bypass technological protections to steal data, then sell it to clients hungry for training material," said Lee. "Reddit is a prime target because it's one of the largest and most dynamic collections of human conversation ever created."
Lee claimed that Oxylabs UAB, a data scraping business based in Lithuania, AWM Proxy, a former Russian botnet, and SerpApi, which advertises real-time access to scraped Google search results, represent textbook examples of this sort of illegal behavior.
"Unable to scrape Reddit directly, they mask their identities, hide their locations, and disguise their web scrapers to steal Reddit content from Google Search," said Lee. "Perplexity is a willing customer of at least one of these scrapers, choosing to buy stolen data rather than enter into a lawful agreement with Reddit itself."
Reddit's complaint likens these three providers to "would-be bank robbers, who, knowing they cannot get into the bank vault, break into the armored truck carrying the cash instead." Echoing Cloudflare CEO Matthew Prince's characterization of Perplexity, the Reddit legal filing describes Perplexity as "more akin to a 'North Korean hacker'" who will do whatever is necessary to obtain the data to fuel its AI answer engine, other than pay for a license.
Google is not participating in the lawsuit but has tried to prevent automated scraping of its search results.
The social media contends that the defendants have violated the US Digital Millennium Copyright Act by bypassing its technological defenses against automated access to its servers. And it accuses SerpApi and Oxylabs specifically of violating the DMCA's prohibition on trafficking in technology circumvention products or services. Other claims include unfair competition, unjust enrichment, and civil conspiracy.
Reddit is seeking an injunction to halt the unwanted scraping of its content and damages.
In June, Reddit filed a similar complaint against Anthropic after it failed to convince the AI business to enter into a content licensing deal as OpenAI has done.
- AI bubble inflates Microsoft CEO pay to $96.5M
- Google porting all internal workloads to Arm, with help from GenAI
- OpenAI releases bot-tom feeding browser with ChatGPT built in
- AI eats leisure time, makes employees work more, study finds
Oxylabs, which advertises itself as "the largest ethical proxy network and advanced scraping solutions empowering the AI industry and beyond," did not immediately respond to a request for comment.
“It doesn’t appear we have received any communication or service from Reddit on this,” said Ryan Schafer, customer service success director at SerpApi, in an email to The Register. “We strongly disagree with Reddit’s allegations and intend to vigorously defend ourselves in court. We don’t have further comments at the moment.”
A spokesperson for Perplexity told The Register, "Perplexity has not yet received the lawsuit, but we will always fight vigorously for users' rights to freely and fairly access public knowledge. Our approach remains principled and responsible as we provide factual answers with accurate AI, and we will not tolerate threats against openness and the public interest."
Reddit is not alone in its attempts to defend against its content being scraped and used to train AI models without consent. A lawsuit [PDF] filed last month on behalf of two authors accuses Apple of "using Books3, a dataset of pirated copyrighted books" to train its OpenELM language models. The complaint against Apple says that the company's AppleBot has been scraping web data for nine years and that data is now being used to improve Apple Intelligence models.
Another case, Millette v. OpenAI (2024), contends that OpenAI scraped YouTube videos unlawfully to improve its models. The New York Times Co. v. Microsoft Corp., OpenAI (2023) makes similar allegations with regard to Microsoft's and OpenAI's alleged use of its news content.
In August, content delivery network Cloudflare called out Perplexity for running web scraping bots that ignore websites' no-scraping directives. ®
Updated at 2000 UTC with comment from serpAPI.
Updated on October 23 at 1452 UTC to add:
In an emailed statement received after this story was filed, Denas Grybauskas, chief governance and strategy officer at Oxylabs, said, "Even though we haven't been served yet, we've read about Reddit's lawsuit naming Oxylabs, along with three unrelated and unaffiliated companies. We are shocked and disappointed by this news, as Reddit has made no attempt to speak with us directly or communicate any potential concerns. Oxylabs has always been and will continue to be a pioneer and an industry leader in public data collection, and it will not hesitate to defend itself against these allegations.
"Oxylabs' position is that no company should claim ownership of public data that does not belong to them."
Grybauskas said Oxylabs creates real value for thousands of businesses, such as those pursuing open source investigations, and that the company demands that businesses use its services lawfully.