This article is more than 1 year old
Brit neural net pioneer just revolutionised speech recognition all over again
Deep learning with Dr Tony Robinson
Profile One of the pioneers of making what's called "machine learning" work in the real world is on the comeback trail.
At Cambridge University's Computer Science department in the 1990s, Dr Tony Robinson taught a generation of students who subsequently turned the Fens into the world speech recognition centre. (Microsoft, Amazon and even secretive Apple have speech labs in Cambridge; Robinson's pioneering work lives on having been tossed around Autonomy, HP and now MicroFocus.) With his latest venture Speechmatics, Robinson finally wants to claim some of that success for himself.
A former student of Robinson's, techie and entrepreneur Matthew Karas, recalls him as "the most popular lecturer in that department by a long way; a lovely guy and a genuine scientific genius. He really is up there with the absolute greats."
What the Teessider achieved was prove that neural networks could work for speech recognition.
"He did what the best speech scientists said was impossible," Karas recalls. "By 1994 he had a system in the top 10 in the world in the DARPA Continuous Speech Evaluation trial. The other nine systems were all hidden Markov models, and Tony's was the only neural network system. He proved it could get into top 10, which was a massive innovation."
With neural networks today tweaked and rebranded as "machine learning" and "deep learning" (which is how Robinson's Speechmatics brands its system), his legacy represents an important but largely unheralded British contribution to the modern world. Even more so since "deep learning" sounds better in a research paper than in reality.
How neural nets revolutionised speech recognition
Robinson himself explains:
The theory goes back to a lot of IBM work way before the 1980s. I could see the potential in the very late 1980s and early 1990s of neural nets and speech recognition. Recurrent neural nets had the wonderful advantage that they feed back on themselves. So much of how you say the next bit depends on how you said anything else. For example, I can tell you're a male speaker just from the sound of your voice. Hidden Markov models have this really weird assumption in them that all of that history didn't matter. The net sample could come from a male speaker or a female speaker, they lost all that consistency. It was the very first time that continuous recognition was done. We used some DSP chip we had lying around.
Karas, who recently won investment in his latest speech ventures, says: "Hidden Markov models work only if you know not just the phoneme context probabilities, but the word context probabilities. The list of viable three-word combinations is very very long.
"With neural networks, the system will assign a probability to something in context without having to know every possible context. It does it by trial and error. The others know the probability, because they have a list of all the hits and misses and divide one by the other. Now, almost all speech recognition systems use neural nets in some way."
But wider recognition has come late, and it's been a rocky road. The height of the dot.com bubble found both Robinson and Karas each working for their own respective speech startups with investment from Mike Lynch's Autonomy.
Autonomy invested in Robinson's first company SoftSound in May 2000. Karas in the meantime had setup BBC News Online, the skunkworks project that "saved the BBC", and then found an application for his speech recognition know-how in turning newsroom video into searchable text: Dremedia. But it was a third startup, Blinkx, that got Autonomy's attention, and through acquisition, Autonomy had built out a diversified set of businesses, losing interest in Robinson's work. He left when Autonomy acquired SoftSound outright in 2006.
Then for a while Robinson led the advanced speech group for SpinVox, which became notorious when the proportion of human transcription was revealed. Insiders told us that "no more than 2 per cent" of messages were actually machine translated, and SpinVox wanted Robinson to build a future system with a much higher level of automation. Within months the company was sold to Nuance. So it was back to the drawing board.
In recent years speech recognition from Amazon, Microsoft and Google has made phenomenal advances. What can Speechmatics boast? What is it and what does it do?
Language models falling short? Make a new one
Despite great advances from the US giants, there are still huge flaws. Six years after launch, Apple's Siri still can't cope with Scottish or Geordie accents. And adding new languages, while necessary to break overseas markets, is a painstaking process.
Reflecting on his 90s work, Robinson had unfinished business. "We'd nearly made it to the tipping point of tipping everyone over. But we didn't. There was a period of slowdown in improvements, although it always got better year on year."
He went back to the drawing board. What emerged has caused ripples across the speech recognition community. Speechmatics showed a real-time, speaker-independent recognition system that could add new languages easily, but ran on an Android phone – or a server on your premises.
"If you read it you would not know it's incredible. It's a technology of such remarkable ingenuity. I was astonished when I heard it was possible," enthuses Karas.
To understand the breakthrough you need to grasp the importance of a probabilistic language model in speech recognition.
"A language model is a huge load of probability tables and data that match sound with word," Karas explains. "Neural networks help because it's quicker to get to a correlation without having to list every probability of every context. The context won't have to be enumerated (was that 'Tattoo' or 'Tattle?')."
But some contexts have a genuine lexical ambiguity, and no matter how much data you have, the machine will struggle. The different uses of row and row are an example. For example, "even with the context, you couldn't know how to say 'row' in the sentence, 'John and Jim were rowing'. A phonetic equivalent might be that the system would need more than just close word context, to distinguish between 'The doctor needs more patience' and 'The doctor needs more patients'," says Karas.
Another wrinkle is accents: people speak differently at different times. Tony Blair, for example, famously adopted Mockney on occasion.
So Robinson devised a new way of doing a language model.
New models used to take months because the process would include the compilation of pronunciation dictionaries, and other canonical data sources, which would then be applied to the training data from real speakers. The training data used to be marked up phonetically, which was only partially automated – requiring laborious manual correction.
"The new system uses an algorithm with an international phoneme set, which can work for a completely unknown language," says Karas. "Give it some Mongolian audio and Mongolian text and, just from the order and frequency of the characters in the text and the properties of the sound wave, it works out which time-slice of the audio matches which word in the text. You can process any transcribed or scripted source, with associated media, search the text, and the results will link directly to the right point in the audio within a few milliseconds."
"Since 2000 I've been with around half a dozen companies," Robinson said. "Small companies are always cash bound and all want to compete with the large companies. It was obvious with this background that if we wanted to compete with very large speech companies we had to produce more than 20 languages, roughly. So with my money how could we do this?"
Speechmatics put a live demo on its website, with a metered usage model. But, unsatisfied, Robinson dismantled the architecture and rebuilt it last year.