This article is more than 1 year old
Is that you, HAL? AI can now see secrets through lipreading – kinda
LipNet's got potential but also a loooong way to go
AI surveillance could be about to get a lot more advanced, as researchers move on from using neural networks for facial recognition to lipreading.
A paper submitted by researchers from the University of Oxford, Google DeepMind and the Canadian Institute for Advanced Research is under review for ICLR 2017 (Conference on Learning Representations), an academic conference for machine learning, and describes a neural network called “LipNet.”
LipNet can decipher what words have been spoken by analyzing the “spatiotemporal visual features” of someone speaking on video to 93.4 per cent accuracy – beating professional human lipreaders.
It’s the first model that works beyond simple word classification to use sentence-level sequence prediction, the researchers claimed.
Lipreading is a difficult task, even for people with hearing loss, who score an average accuracy rate of 52.3 per cent.
“Machine lipreaders have enormous practical potential, with applications in improved hearing aids, silent dictation in public spaces, covert conversations, speech recognition in noisy environments, biometric identification, and silent-movie processing,” the paper said.
But, for those afraid of CCTV cameras reading into secret conversations, don’t start throwing away the funky pixel-distorting glasses that can mask your identity yet.
A closer look at the paper reveals that the impressive accuracy rate only covers a limited dataset of words strung together into sentences that often make no sense, like in the example used in the video below.
The GRID corpus is a series of audio and video recordings of 34 speakers who speak 1,000 sentences each. The sentences all have a structure of the following “simple grammar”: command(4) + color(4) + preposition(4) + letter(25) + digit(10) + adverb(4).
The number in the brackets shows the number of word choices for each category, giving 64,000 possible sentences that can be spoken. Many files were missing or corrupted from the GRID corpus, leaving 32,839 videos from 13 speakers.
LipNet needs a lot of training to work to such a high accuracy. From the total number of videos, roughly 88 per cent were used for training and 12 per cent for testing. It focuses on the various shapes that the speaker’s mouth makes as he or she talks, and breaks it down into image frames.
These are then fed into the neural network as input, and passes over several layers to map the mouth movements into phonemes, to work out the words and sentences phonetically.
LipNet mapping frames into phonemes and words (Photo credit: Assael et al)
It’s a long way off before LipNet is able to handle real, normal conversations between two people. The system will require a ton more data for training to deal with accents and different languages.
But if you’re still worried about cameras deciphering your whispers, maybe wear a mask. ®