Oh no, you're thinking, yet another cookie pop-up. Well, sorry, it's the law. We measure how many people read us, and ensure you see relevant ads, by storing cookies on your device. If you're cool with that, hit “Accept all Cookies”. For more info and to customize your settings, hit “Customize Settings”.

Review and manage your consent

Here's an overview of our use of cookies, similar technologies and how to manage them. You can also change your choices at any time, by hitting the “Your Consent Options” link on the site's footer.

Manage Cookie Preferences
  • These cookies are strictly necessary so that you can navigate the site as normal and use all features. Without these cookies we cannot provide you with the service that you expect.

  • These cookies are used to make advertising messages more relevant to you. They perform functions like preventing the same ad from continuously reappearing, ensuring that ads are properly displayed for advertisers, and in some cases selecting advertisements that are based on your interests.

  • These cookies collect information in aggregate form to help us understand how our websites are being used. They allow us to count visits and traffic sources so that we can measure and improve the performance of our sites. If people say no to these cookies, we do not know how many people have visited and we cannot monitor performance.

See also our Cookie policy and Privacy policy.

This article is more than 1 year old

Mozilla releases voice dataset and transcription engine

Baidu's Deep Speech with TensorFlow under the covers

Mozilla has revealed an open speech dataset and a TensorFlow-based transcription engine.

Mozilla floated "Project Common Voice" back in July 2017, when it called for volunteers to either submit samples of their speech or check machine translations of others' utterances.

The project has since collected 500 hours of samples (in the longer term, Common Voice wants 10,000 hours), comprising 400,000 recordings made by 20,000 people.

The project's Michael Henretty wrote that “most of us only have access to fairly limited collection of voice data; an essential component for creating high-quality speech recognition engines”. Even limited non-free data sets cost “upwards of tens of thousands of dollars”.

Mozilla's Sean White wrote that the job of extending Common Voice beyond English will begin in the first half of 2018.

Common Voice is available for download here, and if developers need more open source speech datasets, Mozilla helpfully links four other sets it was able to identify: LibriSpeech, the TED-LIUM Corpus, VoxForge, and Tatoeba.

Mozilla also announced an associated transcription effort based on Baidu's Deep Speech speech recognition project. Mozilla's Deep Speech “uses Google's TensorFlow project to make the implementation easier”, and claims a 6.5 per cent error rate on the LibriSpeech test-clean dataset.

Mozilla speech components

Mozilla Deep Speech offers pre-built Python and Node.js packages and a command line binary.

In this post at Mozilla Hacks, Rueben Morais described Deep Speech as “an end-to-end trainable, character-level, deep recurrent neural network (RNN) … It can be trained using supervised learning from scratch, without any external 'sources of intelligence', like a grapheme to phoneme converter or forced alignment on the input.”

As Morais noted, with 120 million parameters in the Deep Speech model, the group needed one machine with four Titan X Pascal GPUs, and two more servers with eight of the GPUs each.

The result of all that work was that on a GPU-equipped MacBook Pro, Deep Speech can transcribe a little over three seconds of audio per second. With just a CPU, a second of transcription takes around 1.4 seconds. ®

 

Similar topics

TIP US OFF

Send us news


Other stories you might like