This article is more than 1 year old

How to spot OpenAI's crawler bot and stop it slurping sites for training data

Aww, c'mon, let us scrape your pages, we've got billions at stake

OpenAI, the maker of machine learning models trained on public web data, has published the specifications for its web crawler so that publishers and site owners can opt out of having their content scraped.

The newly released technical document describes how to identify OpenAI's web crawler GPTBot through its user agent token and string, which get emitted by the company's software in the HTTP request header sent to ask a server for a web page.

Web publishers can thus add an entry into their web server's robots.txt file to tell the crawler how it should behave, assuming GPTBot was designed to heed the Robots Exclusion Protocol – not all bots do so. For example, the following set of robots.txt key/value pairs would instruct GPTBot to stay out of the root directory and everything else on the site.

User-agent: GPTBot
Disallow: /

However, OpenAI insists that allowing its bot to collect site data can improve the quality of AI models the biz builds and scraping can be done without gathering sensitive information – for which OpenAI and Microsoft were recently sued.

"Web pages crawled with the GPTBot user agent may potentially be used to improve future models and are filtered to remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies," the ML super-lab's documentation reads.

Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety

"Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety."

And who wouldn't want to save OpenAI the time and expense of making its models more capable and less risky?

Even so, OpenAI's acknowledgement that it trains its large language models on the public internet has coincided with efforts by organizations to limit automated access to information via the web. AI software makers enjoy grabbing all kinds of info from sites to train their models to bank millions if not billions of dollars in revenues. Some businesses are putting their foot down, and closing off access if they're not going to get a cut of that income.

Reddit, for example, recently changed its API terms to better enable the company to monetize the content created free-of-charge by its users. And Twitter recently sued four unidentified entities to prevent site data from being scraped for AI training.

Unleash the legal eagles!

OpenAI did not immediately respond to a request to explain why it published details about GPTBot. But it may not be a coincidence that there have been several recent lawsuits filed against the Microsoft-championed biz for allegedly using publicly accessible data without consent, or in contravention of stated licensing terms.

Beyond the privacy lawsuit noted above, OpenAI, Microsoft, and the latter's GitHub subsidiary were sued in November for allegedly ingesting license-encumbered source code to train OpenAI's Codex model, and then reproducing that code through GitHub's Copilot source-suggestion service. Several book authors last month filed a similar lawsuit alleging OpenAI trained ChatGPT on their work without permission.

Google, DeepMind, and parent Alphabet have also been sued over similar claims.

Given the legal uncertainty arising from scraping public data and using that info to train AI models, it's perhaps unsurprising that Google – an OpenAI rival – last month proposed rethinking how the Robots Exclusion Protocol works.

Israel Krush, CEO and co-founder of Hyro, which makes an AI assistant for the healthcare industry, told The Register there are two main issues with the way web crawling works.

"Firstly, the default setup involves publishers having to actively opt out if they don't want their websites to be crawled and used for fine-tuning," he said. "This process is quite different from how search engines operate, where crawling serves as a reference to direct users to the publishers' sites.

"With OpenAI and AI assistants, the content becomes a direct part of the product, which could sometimes lead to inaccuracies. The fact that publishers have to opt out raises a big concern."

Krush said integrating this content into someone else's product and potentially changing it raises another potential issue.

Azure icon 2021

Microsoft Azure OpenAI lets enterprises feed corporate secrets to ChatGPT

READ MORE

"The second problem is with OpenAI's statement about excluding websites 'known for using Personally Identifiable Information (PII),'" he said. "This statement is a bit puzzling."

"Take news publishers, for instance; they naturally include some identifiable information. Even websites that aren't specifically thought of as holding PII might still have some. Any content involving PII needs to be properly redacted."

Krush argued that compliance concerns and responsible model use require stronger safeguards, noting that his own firm only scrapes data with explicit permission and handles personal information appropriately.

"Instead of just focusing on scraping websites already flagged for PII, OpenAI should assume there's potential for PII across all sites, particularly with publishers," he said. "They should take proactive steps to make sure the scraped info aligns with compliance rules." ®

More about

TIP US OFF

Send us news


Other stories you might like