Rip it apart, cut the chaff, put it back together
"Without wanting to contribute too much to the hype around neural networks here, we had a system that was live on the Speechmatics site and the volume we were doing was going up and up. We knew we needed to get more efficient at it. We took the whole system apart, this time last year, and I set about with two bright guys I work with a task: take it all apart, work out what we need to do, stick it together in the most efficient order. Really question everything we need to do, every assumption. How much can we put in the neural networks? How much can we take away from the CPU-intensive part of it? Get rid of it as much as we can.
"Between the three of us, we came up with a new architecture for doing speech recognition. It heavily relies on neural network acoustic models and language models. We brought the memory down, and the speed up, so it was good enough to go on the phone. We put it on without too much work. But it's using only one processor core."
Even against a noisy background, the demo on an Android was stunningly accurate.
"Last year we were putting languages out every two weeks. There's 27 right now and some more are coming. We're tackling the hardest languages in the world, like Icelandic. A year ago building a new language model was an overnight job, but we've got that down."
There are several reasons why companies would want to use a speech specialist like Speechmatics, rather than Google.
"It's fun. We have so many different people coming to us with so many different needs. Everyone has understandable concerns about who's using the data for what: you want to know it's not leaving the building. We do on-premises work, halfway between cloud and the embedded stuff," Robinson says.
"We can just say: here's a copy, it's the same thing running on Android but we know you're a bank, you cannot have data leave the building, so here's something you can install on the tin and it runs the speech recognition in exactly the same way as it does with the cloud. The cloud is in many ways a shop front for us."
Speechmatics' ability to transcribe and index huge volumes of speech quickly has been noticed in finance and legal circles.
"You need to be able to unwind a financial transaction, and explain that this is the sequence of things that led up to it. A recorded conversation by itself is not good to you. You need to make it searachable. We're just a little tool in their grand scheme of things."
How is Speechmatics able to add new languages so quickly?
"It's the neural networks! We need some data for a particular language, but much less than you normally need, because we can pick up what we've done from other languages. How I make sounds with my mouth is quite similar to a Japanese speaker – you've got the same vocal apparatus. You're making the same sort of sounds.
"So a lot of what we've got to do is, first going from wave form, that acoustic data, is get to the phonemes, the basic sounds of the language. It isn't completely language dependent. So we can have thousands of hours from one language and a smaller amount from another, and just say tweak it a little bit."
Lowering the cost has some unexpected benefits.
"Icelandic has only about 400,000 speakers, and they're worried that their language will die out. But it's a country with only one-twentieth the population of London. If it was expensive to do, we'd never do Icelandic."
And the near future, and long-term goals?
"How can we ensure as many people as possible can use it? One of the things I like about commercial research is that people actually use it. You can publish four-page papers on your work, and people just fall asleep.
"We have released the API to our cloud version and the API to the real-time embedded one is almost ready. There are business problems to sort out – like licensing – but we want to stay the most accurate."
Even if the "neural network hype bursts, we've got a solid base of users." ®