How to democratize ML? More public data, says MLCommons
Foundation makes 30k hours of speech and 340k keywords in 50 languages available online
Unless you're an English speaker, and one with as neutral an American accent as possible, you've probably butted heads with a digital assistant that couldn't understand you. With any luck, a couple of open-source datasets from MLCommons could help future systems grok your voice.
The two datasets, which were made generally available in December, are the People's Speech Dataset (PSD), a 30,000-hour database of spontaneous English speech; and the Multilingual Spoken Words Corpus (MSWC), a dataset of some 340,000 keywords in 50 languages.
By making both datasets publicly available under CC-BY and CC-BY-SA licenses, MLCommons hopes to democratize machine learning – that is to say, make it available to everyone – and help push the industry toward data-centric AI.
David Kanter, executive director and founder of MLCommons, told Nvidia in a podcast this week that he sees data-centric AI as a conceptual pivot from "which model is the most accurate," to "what can we do with data to improve model accuracy." For that, Kanter said, the world needs lots of data.
Increasing understanding with the People's Speech
Spontaneous speech recognition is still challenging for AIs, and the PSD could help learning machines better understand colloquial speech, speech disorders and accents. Had a database like this existed earlier, said PSD project lead Daniel Galvez, "we'd likely be speaking to our digital assistants in a much less robotic way."
The 30,000 hours of speech in the People's Speech Dataset was culled from a total of 50,000 hours of publicly available speech pulled from the Internet Archive digital library, and it has two unique qualities: Firstly, it's entirely spontaneous speech, meaning it contains all the tics and imprecisions of the average conversation. Second, it all came with transcripts.
By using some CUDA-powered inference engine tricks, the team behind PSD was able to reduce labeling time of that massive dataset to just two days. The end result was a dataset that can allow chatbots and other speech recognition programs to better understand those with voices that differ from those of American English-speaking, white, males.
Galvez said that speech disorders, neurological issues and accents are all poorly represented in datasets, and as a result, "[those types of speech] aren't well understood by commercial products."
Again, said Kanter, projects like those fail because of a lack of data that includes diverse speakers.
A corpus to broaden the reach of digital assistants
The Multilingual Spoken Words Corpus is a different animal from the PSD. Instead of complete sentences, the Corpus consists of 340,000 keywords in 50 languages. "To our knowledge this is the only open-source spoken word dataset for 46 of these 50 languages," Kanter said.
Digital assistants, like chatbots, are prone to bias based on their training datasets, which has led to them not catching on as quickly as they could have. Kanter predicts that digital assistants will be available worldwide "by mid-decade," and he sees the MSWC as a key base for making that happen.
"When you look at equivalent databases, it's Mandarin, English, Spanish, and then it falls off pretty quick," Kanter said.
Kanter said the datasets were already tested by some of the MLCommons member companies. So far, he said they're being used to de-noise audio and video recordings of crowded rooms and conferences, and for improving speech recognition.
In the near future, Kanter said he hopes the datasets will be widely adopted and used alongside other public datasets that commonly serve as sources for ML and AI researchers. ®