If you've been amazed by Amazon's Alexa, Microsoft's Cortana and Google Assistant, you might think continuous speech recognition is done and dusted – and that there are no mountains left to climb. However, a young British company has developed a radical new approach with spectacular results, based on low-level signal processing.
Unlike speech-to-text products, Eloqute analyses speech habits in real time. The result is an educational tool designed to improve an English* speaker's pronunciation – something with a huge and growing market as business travellers seek to impress their clients, and more call centres use non-native English speakers.
The software notches up several technical firsts: giving the user real-time prioritised feedback as they speak, and the ability for the speaker to use any text they want, rather than stock phrases. Remarkably, it will perform this magic on a client device such as a phone.
Total recog: British AI makes universal speech breakthroughREAD MORE
"It's difficult and expensive for a non-native English speaker to improve their pronunciation beyond a certain point. Repetitive home learning doesn't work very well, and beyond that, their only option is expensive private tuition," Speech Engineering Ltd's (SEL) Matthew Karas told us.
It's demoralising to be asked to repeat the phrases Linguaphone gives you, he said, so many who start using the software give up. Eloqute spots the most striking pronunciation errors first, and then, via simple targeted advice, prioritises the skills which have most impact on intelligibility. No other software, Karas said, focuses on identifying habits rather than individual errors.
Eloqute is the first commercial product from SEL, whose Karas and Josh Greifer both have storied backgrounds. Karas built the world's first industrial-strength CMS for the BBC News skunkworks – a story we told here – then founded a speech-recognition startup, sold to Mike Lynch's Autonomy in 2003.
After a period as a games programmer in the '80s, Greifer went to work for Charlie Steinberg, writing the audio parts of Cubase. That might not seem relevant right away, but it is: some major technical breakthroughs came when Eloqute's creators started to work where computational linguists traditionally fear to tread – down in the waveform.
"Language technologists can be scared of low-level, real-time signal processing, so they usually get the OS to handle it," Karas explained. "To get something like Cubase to work, you have to guarantee very low latency – musicians need to hear what they're playing soon enough for it to feel instantaneous, while remaining in sync with the backing, and applying effects, mix automation etc."
Compute - scale by smartphone
Greifer's familiarity with complex low-latency processes turned out to be important when they realised the cost of server-based delivery.
"Streaming speech from 300 million learners to the cloud does not scale nicely."
So they started work on a platform which can optimise any combination of speech analysis algorithms on the phone, sometimes achieving 100-fold improvements on legacy techniques. There are big implications of what SEL's underlying platform does that is far broader than computational linguistics.
"We can switch between different configurations of complex processes a hundred times second. We did this to scale our language app, but now that we have the platform, it could be used for things like on-phone video processing – or even speech recognition."
In plain English, SEL is using the immense and untapped processing power of client devices such as phones to do more, and Eloqute is just the first example. Today's phones have eight or 12 cores sitting idle most of the time. What's exciting is that by applying that power selectively every a few milliseconds, a humble phone can perform better than a company with a vast investment in server farms: an Amazon, Facebook or Google.
But for Karas, the most appealing feature is freeing the learner from soul-destroying rote learning.
"Eloqute will help you if you're rehearsing a conference speech, or reading a bedtime story to your kids. This has more benefits than staying motivated: learners don't expose their bad habits when being tested on a fragment of speech, and they don't form new good habits by parroting phrases."
SEL is launching Eloqute via traditional teachers first: "We are talking to large classroom-based operators, like Education First and Apollo English in Vietnam. They are serious about getting results because they come face-to-face with students, and they really know how to analyse learning outcomes." ®
* The product supports English for now, but should be able to adapt to other language models fairly easily. "The tech is completely language neutral - even tonal languages like Chinese and Vietnamese would be possible," Karas told us. "However the market for English is bigger than all the others put together. We will probably use tone and rhythm to teach better English pronunciation in future, before we'd ever get onto other languages."