This article is more than 1 year old

4chan and other web sewers scraped up into Google's mega-library for training ML

Are you still so keen to have generative AI write your emails, sales proposals, blog posts ... ?

Problematic, racist, and pornographic web content is seemingly being used to train Google's large language models, despite efforts to filter out that strata of toxic and harmful text.

An investigation by The Washington Post and the Allen Institute for AI analyzed Google's immense public C4 dataset, released for academic research, to get a better understanding of what types of websites are typically scraped to train large language models.

The C4 dataset was used to train Google's T5 Text-to-Text Transfer Transformer as well as Facebook's Large Language Model Meta AI (LLaMA), a variant of which raised alarm bells.

It appears C4 has ingested concerning material, which is being used to build next-gen machine-learning systems. That potentially could cause those systems to behave inappropriately and unreliably.

Regular Register readers will be aware we've pointed out problems with training datasets over and over, such as the horrible underbelly of a highly cited set curated by MIT.

Latest probe

The Post and Allen Institute's analysts ranked the top 10 million websites included in C4 by matching text that appeared as internet content. Although C4 is a smaller, cleaner version of the Common Crawl dataset, which comprises text from billions of websites, it still contained undesirable material from dark corners of the internet.

Racist, anti-trans, and toxic text were scraped from websites such as the race-hate haven Stormfront, the doxxing forum Kiwi Farms, and toxic message board 4chan. It's therefore unsurprising that language models based on that corpus can generate inappropriate content, talk of conspiracy theories, or bring up dubious ideologies.

C4 is also made up of websites hosting degrees of personal information, such as voter registration databases. In the background of this, several regulatory agencies in Italy, Canada, Spain, and France have since launched investigations into OpenAI's ChatGPT over data privacy concerns, since the model can ingest and generate sensitive information.

Large language models powering AI chatbots are not intelligent nor conscious, no matter how magic they seem: they write by predicting the flow of words and sentences in response to prompts, questions, and instructions from users or even other bots. This involves drawing upon the mountains of data they've been trained on, and learning from it, to emulate what a person would write.

These predictions therefore reflect patterns in the kinds of text humanity produces, such as internet posts, news articles, poetry, and novels, that is all vacuumed up into vast training datasets.

These systems cannot tell fact from fiction, are fed vast amounts of data scraped from the internet, and can generate inaccurate results as well as regurgitate information. 

Companies that build large language models try to filter out unwanted content, in the training and inference stages, though their review processes are imperfect. What's also frustrating is that builders of commercial AI models - such as OpenAI's ChatGPT, Microsoft's new Bing, or Google's Bard chat - don't always disclose how they sourced, scrubbed, and processed their training data. 

Fortunately, the C4 dataset isn't as bad as others: it mostly contains material scraped from more benign websites spanning journalism, software development, medicine, and content creation. Most of its text comes from Google patents, Wikipedia, and Scribd. The New York Times and scientific journals from academic publisher PLOS ranked fourth and fifth respectively by volume in the dataset. C4 also features content from individuals' blogs, religious websites, and more. 

Copyrighted material is swept up in the dataset, too, with the © symbol appearing more than 200 million times. It's not clear whether companies building AI products based on training data containing protected works are liable for infringing intellectual property.

Stability AI, a startup building text-to-image tools has been sued for scraping copyrighted images from stock photo platforms. OpenAI also faces a lawsuit challenging its collection of public code hosted on GitHub used to create Microsoft's AI-pair-programming Copilot tool.

Reddit just announced an update to its terms and conditions for its API services, requiring companies to pay for licenses to scrape its data. "We are introducing a new premium access point for third parties who require additional capabilities, higher usage limits, and broader usage rights," it stated on Tuesday.

C4 contains content from the internet up until 2019, but as other more recent models were built with similar data collection practices this research shines a light on how AI chatbots can produce problematic output.

The Register has asked the Allen Institute of AI for further comment. ®

More about


Send us news

Other stories you might like