AI + ML

This article is more than 1 year old

4chan and other web sewers scraped up into Google's mega-library for training ML

Are you still so keen to have generative AI write your emails, sales proposals, blog posts ... ?

Thu 20 Apr 2023 // 07:30 UTC

Problematic, racist, and pornographic web content is seemingly being used to train Google's large language models, despite efforts to filter out that strata of toxic and harmful text.

An investigation by The Washington Post and the Allen Institute for AI analyzed Google's immense public C4 dataset, released for academic research, to get a better understanding of what types of websites are typically scraped to train large language models.

The C4 dataset was used to train Google's T5 Text-to-Text Transfer Transformer as well as Facebook's Large Language Model Meta AI (LLaMA), a variant of which raised alarm bells.

It appears C4 has ingested concerning material, which is being used to build next-gen machine-learning systems. That potentially could cause those systems to behave inappropriately and unreliably.

Regular Register readers will be aware we've pointed out problems with training datasets over and over, such as the horrible underbelly of a highly cited set curated by MIT.

Latest probe

The Post and Allen Institute's analysts ranked the top 10 million websites included in C4 by matching text that appeared as internet content. Although C4 is a smaller, cleaner version of the Common Crawl dataset, which comprises text from billions of websites, it still contained undesirable material from dark corners of the internet.

Racist, anti-trans, and toxic text were scraped from websites such as the race-hate haven Stormfront, the doxxing forum Kiwi Farms, and toxic message board 4chan. It's therefore unsurprising that language models based on that corpus can generate inappropriate content, talk of conspiracy theories, or bring up dubious ideologies.

C4 is also made up of websites hosting degrees of personal information, such as voter registration databases. In the background of this, several regulatory agencies in Italy, Canada, Spain, and France have since launched investigations into OpenAI's ChatGPT over data privacy concerns, since the model can ingest and generate sensitive information.

Large language models powering AI chatbots are not intelligent nor conscious, no matter how magic they seem: they write by predicting the flow of words and sentences in response to prompts, questions, and instructions from users or even other bots. This involves drawing upon the mountains of data they've been trained on, and learning from it, to emulate what a person would write.

These predictions therefore reflect patterns in the kinds of text humanity produces, such as internet posts, news articles, poetry, and novels, that is all vacuumed up into vast training datasets.

These systems cannot tell fact from fiction, are fed vast amounts of data scraped from the internet, and can generate inaccurate results as well as regurgitate information.

Companies that build large language models try to filter out unwanted content, in the training and inference stages, though their review processes are imperfect. What's also frustrating is that builders of commercial AI models - such as OpenAI's ChatGPT, Microsoft's new Bing, or Google's Bard chat - don't always disclose how they sourced, scrubbed, and processed their training data.

Fortunately, the C4 dataset isn't as bad as others: it mostly contains material scraped from more benign websites spanning journalism, software development, medicine, and content creation. Most of its text comes from Google patents, Wikipedia, and Scribd. The New York Times and scientific journals from academic publisher PLOS ranked fourth and fifth respectively by volume in the dataset. C4 also features content from individuals' blogs, religious websites, and more.

Copyrighted material is swept up in the dataset, too, with the © symbol appearing more than 200 million times. It's not clear whether companies building AI products based on training data containing protected works are liable for infringing intellectual property.

Stability AI, a startup building text-to-image tools has been sued for scraping copyrighted images from stock photo platforms. OpenAI also faces a lawsuit challenging its collection of public code hosted on GitHub used to create Microsoft's AI-pair-programming Copilot tool.

Reddit just announced an update to its terms and conditions for its API services, requiring companies to pay for licenses to scrape its data. "We are introducing a new premium access point for third parties who require additional capabilities, higher usage limits, and broader usage rights," it stated on Tuesday.

C4 contains content from the internet up until 2019, but as other more recent models were built with similar data collection practices this research shines a light on how AI chatbots can produce problematic output.

Topics

Special Features

Vendor Voice

Resources

AI + ML

4chan and other web sewers scraped up into Google's mega-library for training ML

Are you still so keen to have generative AI write your emails, sales proposals, blog posts ... ?

Latest probe

More about

More about

Narrower topics

Broader topics

More about

More about

More about

Narrower topics

Broader topics

TIP US OFF

Other stories you might like

MPs ask: Why is it so freakin' hard to get AI giants to pay copyright holders?

Logitech intros free tool for ChatGPT prompts... plus a mouse with an AI button

AI spam is winning the battle against search engine quality

Reducing the cloud security overhead

US House mulls forcing AI makers to reveal use of copyrighted training data

What's up with AI lately? Let's start with soaring costs, public anger, regulations...

Google Cloud chief is really psyched about this AI thing

How to coax ChatGPT into making better predictions: Get it to tell tales from the future

AI PCs are here but a killer application for biz users? Nope

Psst, hey. It's the NSA. You want some AI security advice?

UK unions publish AI bill to protect workers from 'risks and harms' of tech

Devaluing content created by AI is lazy and ignores history

About Us

Our Websites

Your Privacy