Software

AI + ML

MIT apologizes, permanently pulls offline huge dataset that taught AI systems to use racist, misogynistic slurs

Top uni takes action after El Reg highlights concerns by academics


Special report MIT has taken offline its highly cited dataset that trained AI systems to potentially describe people using racist, misogynistic, and other problematic terms.

The database was removed this week after The Register alerted the American super-college. MIT also urged researchers and developers to stop using the training library, and to delete any copies. "We sincerely apologize," a professor told us.

The training set, built by the university, has been used to teach machine-learning models to automatically identify and list the people and objects depicted in still images. For example, if you show one of these systems a photo of a park, it might tell you about the children, adults, pets, picnic spreads, grass, and trees present in the snap. Thanks to MIT's cavalier approach when assembling its training set, though, these systems may also label women as whores or bitches, and Black and Asian people with derogatory language. The database also contained close-up pictures of female genitalia labeled with the C-word.

Applications, websites, and other products relying on neural networks trained using MIT's dataset may therefore end up using these terms when analyzing photographs and camera footage.

The problematic training library in question is 80 Million Tiny Images, which was created in 2008 to help produce advanced object-detection techniques. It is, essentially, a huge collection of photos with labels describing what's in the pics, all of which can be fed into neural networks to teach them to associate patterns in photos with the descriptive labels. So when a trained neural network is shown a bike, it can accurately predict a bike is present in the snap. It's called Tiny Images because the pictures in library are small enough for computer-vision algorithms in the late-2000s and early-2010s to digest.

Today, the Tiny Images dataset is used to benchmark computer-vision algorithms along with the better-known ImageNet training collection. Unlike ImageNet, though, no one, until now, has scrutinized Tiny Images for problematic content.

Vinay Prabhu, chief scientist at UnifyID, a privacy startup in Silicon Valley, and Abeba Birhane, a PhD candidate at University College Dublin in Ireland, pored over the MIT database and discovered thousands of images labelled with racist slurs for Black and Asian people, and derogatory terms used to describe women. They revealed their findings in a paper [pre-print PDF] submitted to a computer-vision conference due to be held next year.

Graph showing the number of pictures in the MIT dataset labeled with selected problematic words ... Source: Prabhu and Birhane

The dataset holds more than 79,300,000 images, scraped from Google Images, arranged in 75,000-odd categories. A smaller version, with 2.2 million images, could be searched and perused online from the website of MIT’s Computer Science and Artificial Intelligence Lab (CSAIL). This visualization, along with the full downloadable database, were removed on Monday from the CSAIL website after El Reg alerted the dataset's creators to the work done by Prabhu and Birhane.

The key problem is that the dataset includes, for example, pictures of Black people and monkeys labeled with the N-word; women in bikinis, or holding their children, labeled whores; parts of the anatomy labeled with crude terms; and so on – needlessly linking everyday imagery to slurs and offensive language, and baking prejudice and bias into future AI models.

A screenshot of the 2.2m dataset visualization before it was taken offline this week. It shows some of the dataset's examples for the label 'whore', which we've pixelated for legal and decency reasons. The images ranged from a headshot photo of woman and a mother holding her baby with Santa to porn actresses and a woman in a bikini ... Click to enlarge

Antonio Torralba, a professor of electrical engineering and computer science at CSAIL, said the lab wasn't aware these offensive images and labels were present within the dataset at all. “It is clear that we should have manually screened them,” he told The Register. “For this, we sincerely apologize. Indeed, we have taken the dataset offline so that the offending images and categories can be removed.”

In a statement on its website, however, CSAIL said the dataset will be permanently pulled offline because the images were too small for manual inspection and filtering by hand. The lab also admitted it automatically obtained the images from the internet without checking whether any offensive pics or language were ingested into the library, and it urged people to delete their copies of the data:

It has been brought to our attention that the Tiny Images dataset contains some derogatory terms as categories and offensive images. This was a consequence of the automated data collection procedure that relied on nouns from WordNet. We are greatly concerned by this and apologize to those who may have been affected.

The dataset is too large (80 million images) and the images are so small (32 x 32 pixels) that it can be difficult for people to visually recognize its content. Therefore, manual inspection, even if feasible, will not guarantee that offensive images can be completely removed.

We therefore have decided to formally withdraw the dataset. It has been taken offline and it will not be put back online. We ask the community to refrain from using it in future and also delete any existing copies of the dataset that may have been downloaded.

Prof Torralba told us a little more on how the library was constructed: a huge list of words – including derogatory terms – was obtained and then code was written to search the web for images using these words, and combine the lot. The result was a dataset containing raw internet material.

“The dataset contains 53,464 different nouns, directly copied over from WordNet," Prof Torralba said referring to Princeton University's database of English words grouped into related sets. "These were then used to automatically download images of the corresponding noun from internet search engines at the time, using the available filters at the time, to collect the 80 million images.”

WordNet was built in the mid-1980s at Princeton's Cognitive Science Laboratory under George Armitage Miller, one of the founders of cognitive psychology. “Miller was obsessed with the relationships between words,” Prabhu told us. “The database essentially maps how words are associated with one another.”

For example, the words cat and dog are more closely related than cat and umbrella. Unfortunately, some of the nouns in WordNet are racist slang and insults. Now, decades later, with academics and developers using the database as a convenient silo of English words, those terms haunt modern machine learning.

“When you are building huge datasets, you need some sort of structure,” Birhane told El Reg. “That’s why WordNet is effective. It provides a way for computer-vision researchers to categorize and label their images. Why do that yourself when you could just use WordNet?”

WordNet may not be so harmful on its own, as a list of words, though when combined with images and AI algorithms, it can have upsetting consequences. “The very aim of that [WordNet] project was to map words that are close to each other," said Birhane. "But when you begin associating images with those words, you are putting a photograph of a real actual person and associating them with harmful words that perpetuate stereotypes.”

ImageNet has the same problems, too, as it was also annotated using WordNet. An experiment dubbed ImageNet Roulette allowed people to submit photos to a neural network trained from ImageNet that would describe the images using labels from the dataset. Unsurprisingly, people fed the system snaps that fascinated them the most: their selfies. Some were shocked when the software described them using racist and offensive labels.

This image-recognition roulette is all fun and games... until it labels you a rape suspect, divorcee, or a racial slur

READ MORE

The fraction of problematic images and labels in these giant datasets is small, and it’s easy to brush them off as anomalies. Yet this material can lead to real harm if they’re used to train machine-learning models that are used in the real world, Prabhu and Birhane argued.

“The absence of critical engagement with canonical datasets disproportionately negatively impacts women, racial and ethnic minorities, and vulnerable individuals and communities at the margins of society,” they wrote in their paper.

These groups are often not well represented in AI training datasets; it’s the reason why facial-recognition algorithms struggle with identifying women and people of darker skin. A Black man in Detroit was wrongfully arrested by cops after being mistaken for a suspected thief by facial-recognition software earlier this year. It’s also the reason why a controversial AI algorithm that generates high-resolution images from low-resolution snaps turned a blurry photo of Barack Obama into someone more Caucasian than Black.

“People don’t think about how these models are going to applied or what it could be used for," said Birhane. "They just think ‘oh, here’s this cool thing I can do’. But when you start thinking deeper, you will start to find all these insidious purposes and see how these harms manifest.”

Giant datasets like ImageNet and 80 Million Tiny Images are also often collected by scraping photos from Flickr or Google Images without people’s explicit consent. Meanwhile, Facebook hired actors who agreed to have their faces used in a dataset designed to teach software to detect computer-generated faked images.

Prabhu and Birhane said the social network's approach was a good idea, though they noted academic studies are unlikely to have the funding to pay actors to star in training sets. “We acknowledge that there is no perfect solution to create an ideal dataset, but that doesn’t mean people shouldn’t try and create better ones,” they said.

The duo suggested blurring people’s faces in datasets focused on object recognition, carefully screening the images and labels to remove any offensive material, and even training systems using realistic synthetic data. “You don’t need to include racial slurs, pornographic images, or pictures of children," they said. "Doing good science and keeping ethical standards is not mutually exclusive.” ®

Send us news
98 Comments

How to destroy expensive test kit: What does that button do?

Fidgety fingers and boredom = trouble

Who, Me? All aboard for a nautical installment of Who, Me? where the words "Don't Touch That Button!" have an altogether damper meaning.

Today's tale comes from a reader Regomised as "Trev" and has a slightly naval tinge to it.

"I was involved in installing a system in a corvette for a Middle Eastern navy," he told us. "Our customer was equal parts naive, hopeful, and bloody difficult."

Continue reading

Galileo satnav system gets two new somewhat confusing satellites

Despite being the 27th and 28th launched, they're the first of a dozen first-gen birds

The European Space Agency (ESA) has announced the successful launch of the 27th and 28th satellites in its Galileo satnav constellation on Sunday.

"With these satellites we are now increasing the robustness of the constellation so that a higher level of service guarantees can be provided," said ESA Director of Navigation Paul Verhoef.

The 715kg satellites were launched by Arianespace-operated Soyuz launcher VS-26 from Europe's Spaceport in French Guiana, as seen in this photo tweeted by the ESA:

Continue reading

Alibaba splits itself into Chinese and overseas ops

Seeks world domination through 'diversified business governance'

Chinese e-commerce giant Alibaba is splitting in two.

Or as the company put it in a blog post: "Alibaba Introduces More Agile Organizational Structure to Accelerate Domestic and International Growth."

The structure will see a new unit called "International Digital Commerce" deal with retail and wholesale customers outside China. Current Taobao and Tmall president Jiang Fan will lead the organisation, which will include AliExpress, Alibaba.com and Southeast-Asia centric e-tailer Lazada.

Continue reading

China's Yutu rover spots 'mysterious hut' on far side of the Moon

Cube-shaped object is probably just a rock. Yutu will check it out anyway

China's Moon rover, Yutu 2, has sent images of a strangely geometric object.

A post to Chinese social media site qq.com describes the object as a "mysterious hut" thanks to its cubic shape.

The post was made by "Our Space" – a qq.com account that promotes the feats of China's National Space Agency (CNSA) and appears to have some official approval to do so.

Continue reading

Microsoft to 600 million Indians: feel free to hand over some data

LinkedIn adds Hindi service to target the world's third-most-spoken language

Microsoft's social network LinkedIn has added a Hindi version of its service.

File this one under "what took you so long?" because, as LinkedIn's announcement notes, over 600 million people speak Hindi. That makes it the third-most-spoken language in the world, behind English and Mandarin. LinkedIn already serves languages with far fewer speakers, including Norwegian or Thai.

That the service has amassed over 82 million Indian users – its second-largest national population – without supporting Hindi suggests the network's reasoning: English is widely spoken in India and very widely used in business, academia, the media, and of course the technology industry.

Continue reading

India reveals home-grown server that won't worry the leading edge

And a National Blockchain Strategy that calls for gov to host BaaS

India's government has revealed a home-grown server design that is unlikely to threaten the pacesetters of high tech, but (it hopes) will attract domestic buyers and manufacturers and help to kickstart the nation's hardware industry.

The "Rudra" design is a two-socket server that can run Intel's Cascade Lake Xeons. The machines are offered in 1U or 2U form factors, each at half-width. A pair of GPUs can be equipped, as can DDR4 RAM.

Cascade Lake emerged in 2019 and has since been superseded by the Ice Lake architecture launched in April 2021. Indian authorities know Rudra is off the pace, and said a new design capable of supporting four GPUs is already in the works with a reveal planned for June 2022.

Continue reading

Prisons transcribe private phone calls with inmates using speech-to-text AI

Plus: A drug designed by machine learning algorithms to treat liver disease reaches human clinical trials and more

In brief Prisons around the US are installing AI speech-to-text models to automatically transcribe conversations with inmates during their phone calls.

A series of contracts and emails from eight different states revealed how Verus, an AI application developed by LEO Technologies and based on a speech-to-text system offered by Amazon, was used to eavesdrop on prisoners’ phone calls.

In a sales pitch, LEO’s CEO James Sexton told officials working for a jail in Cook County, Illinois, that one of its customers in Calhoun County, Alabama, uses the software to protect prisons from getting sued, according to an investigation by the Thomson Reuters Foundation.

Continue reading

<i>Battlefield 2042</i>: Please don't be the death knell of the franchise, please don't be the death knell of the franchise

Another terrible launch, but DICE is already working on improvements

The RPG Greetings, traveller, and welcome back to The Register Plays Games, our monthly gaming column. Since the last edition on New World, we hit level cap and the "endgame". Around this time, item duping exploits became rife and every attempt Amazon Games made to fix it just broke something else. The post-level 60 "watermark" system for gear drops is also infuriating and tedious, but not something we were able to address in the column. So bear these things in mind if you were ever tempted. On that note, it's time to look at another newly released shit show – Battlefield 2042.

I wanted to love Battlefield 2042, I really did. After the bum note of the first-person shooter (FPS) franchise's return to Second World War theatres with Battlefield V (2018), I stupidly assumed the next entry from EA-owned Swedish developer DICE would be a return to form. I was wrong.

The multiplayer military FPS market is dominated by two forces: Activision's Call of Duty (COD) series and EA's Battlefield. Fans of each franchise are loyal to the point of zealotry with little crossover between player bases. Here's where I stand: COD jumped the shark with Modern Warfare 2 in 2009. It's flip-flopped from WW2 to present-day combat and back again, tried sci-fi, and even the Battle Royale trend with the free-to-play Call of Duty: Warzone (2020), which has been thoroughly ruined by hackers and developer inaction.

Continue reading

American diplomats' iPhones reportedly compromised by NSO Group intrusion software

Reuters claims nine State Department employees outside the US had their devices hacked

The Apple iPhones of at least nine US State Department officials were compromised by an unidentified entity using NSO Group's Pegasus spyware, according to a report published Friday by Reuters.

NSO Group in an email to The Register said it has blocked an unnamed customers' access to its system upon receiving an inquiry about the incident but has yet to confirm whether its software was involved.

"Once the inquiry was received, and before any investigation under our compliance policy, we have decided to immediately terminate relevant customers’ access to the system, due to the severity of the allegations," an NSO spokesperson told The Register in an email. "To this point, we haven’t received any information nor the phone numbers, nor any indication that NSO’s tools were used in this case."

Continue reading

Utility biz Delta-Montrose Electric Association loses billing capability and two decades of records after cyber attack

All together now - R, A, N, S, O...

A US utility company based in Colorado was hit by a ransomware attack in November that wiped out two decades' worth of records and knocked out billing systems that won't be restored until next week at the earliest.

The attack was detailed by the Delta-Montrose Electric Association (DMEA) in a post on its website explaining that current customers won't be penalised for being unable to pay their bills because of the incident.

"We are a victim of a malicious cyber security attack. In the middle of an investigation, that is as far as I’m willing to go," DMEA chief exec Alyssa Clemsen Roberts told a public board meeting, as reported by a local paper.

Continue reading

Feds charge two men with claiming ownership of others' songs to steal YouTube royalty payments

Alleged scheme said to have netted $20m since 2017

The US Attorney's Office of Arizona on Wednesday announced the indictment of two men on charges that they defrauded musicians and associated companies by claiming more than $20m in royalty payments for songs played on YouTube.

The 30-count indictment against Jose Teran, 36, of Scottsdale, Arizona, and Webster Batista, 38, of Doral, Florida, was returned by a grand jury on November 16, 2021. It accuses the two men of conspiracy, wire fraud, transactional money laundering, and aggravated identity theft in connection with a scheme to steal YouTube payments.

"In short, Batista and Teran, as individuals and through various entities that they operate and control, fraudulently claimed to have the legal rights to monetize a music library of more than 50,000 songs," the indictment [PDF] alleges.

Continue reading