Oh no, you're thinking, yet another cookie pop-up. Well, sorry, it's the law. We measure how many people read us, and ensure you see relevant ads, by storing cookies on your device. If you're cool with that, hit “Accept all Cookies”. For more info and to customize your settings, hit “Customize Settings”.

Review and manage your consent

Here's an overview of our use of cookies, similar technologies and how to manage them. You can also change your choices at any time, by hitting the “Your Consent Options” link on the site's footer.

Manage Cookie Preferences
  • These cookies are strictly necessary so that you can navigate the site as normal and use all features. Without these cookies we cannot provide you with the service that you expect.

  • These cookies are used to make advertising messages more relevant to you. They perform functions like preventing the same ad from continuously reappearing, ensuring that ads are properly displayed for advertisers, and in some cases selecting advertisements that are based on your interests.

  • These cookies collect information in aggregate form to help us understand how our websites are being used. They allow us to count visits and traffic sources so that we can measure and improve the performance of our sites. If people say no to these cookies, we do not know how many people have visited and we cannot monitor performance.

See also our Cookie policy and Privacy policy.

This article is more than 1 year old

Is that you, HAL? AI can now see secrets through lipreading – kinda

LipNet's got potential but also a loooong way to go

AI surveillance could be about to get a lot more advanced, as researchers move on from using neural networks for facial recognition to lipreading.

A paper submitted by researchers from the University of Oxford, Google DeepMind and the Canadian Institute for Advanced Research is under review for ICLR 2017 (Conference on Learning Representations), an academic conference for machine learning, and describes a neural network called “LipNet.”

LipNet can decipher what words have been spoken by analyzing the “spatiotemporal visual features” of someone speaking on video to 93.4 per cent accuracy – beating professional human lipreaders.

It’s the first model that works beyond simple word classification to use sentence-level sequence prediction, the researchers claimed.

Lipreading is a difficult task, even for people with hearing loss, who score an average accuracy rate of 52.3 per cent.

“Machine lipreaders have enormous practical potential, with applications in improved hearing aids, silent dictation in public spaces, covert conversations, speech recognition in noisy environments, biometric identification, and silent-movie processing,” the paper said.

But, for those afraid of CCTV cameras reading into secret conversations, don’t start throwing away the funky pixel-distorting glasses that can mask your identity yet.

A closer look at the paper reveals that the impressive accuracy rate only covers a limited dataset of words strung together into sentences that often make no sense, like in the example used in the video below.

Youtube Video

The GRID corpus is a series of audio and video recordings of 34 speakers who speak 1,000 sentences each. The sentences all have a structure of the following “simple grammar”: command(4) + color(4) + preposition(4) + letter(25) + digit(10) + adverb(4).

The number in the brackets shows the number of word choices for each category, giving 64,000 possible sentences that can be spoken. Many files were missing or corrupted from the GRID corpus, leaving 32,839 videos from 13 speakers.

LipNet needs a lot of training to work to such a high accuracy. From the total number of videos, roughly 88 per cent were used for training and 12 per cent for testing. It focuses on the various shapes that the speaker’s mouth makes as he or she talks, and breaks it down into image frames.

These are then fed into the neural network as input, and passes over several layers to map the mouth movements into phonemes, to work out the words and sentences phonetically.

LipNet mapping frames into phonemes and words (Photo credit: Assael et al)

It’s a long way off before LipNet is able to handle real, normal conversations between two people. The system will require a ton more data for training to deal with accents and different languages.

But if you’re still worried about cameras deciphering your whispers, maybe wear a mask. ®

 

Similar topics

TIP US OFF

Send us news


Other stories you might like