In a machine learning tug-of-war, Microsoft may have just barely slipped ahead of IBM for speech transcription accuracy.
Researchers are studying how to recognise human speech in a variety of settings – from realtime interactions to offline, pre-recorded voicemails. Boffins tell us that one application, particularly of offline transcription, could be government surveillance.
In March, IBM researchers claimed that they had achieved a word recognition error rate of 5.5 per cent for pre-recorded English telephone conversations between strangers on set topics such as sports. They're presenting their peer-reviewed research this week (PDF) at the INTERSPEECH 2017 conference in Stockholm, Sweden.
Like the IBM work, its algorithms used deep learning architectures for acoustic and language modelling. Microsoft claims it had achieved a word error rate of 5.9 per cent last year and credits its bump to "using the most scalable deep learning software available, Microsoft Cognitive Toolkit 2.1 (CNTK), for exploring model architectures and optimizing the hyper-parameters of our models. Additionally, Microsoft's investment in cloud compute infrastructure, specifically Azure GPUs, helped to improve the effectiveness and speed by which we could train our models and test new ideas."
Eric Postma, a computer scientist at Tilburg University in the Netherlands who studies speech recognition, told The Register it is "a significant step forward" but "not a breakthrough" because the goal is to achieve human-level recognition – like being able to comprehend utterances with multiple voices speaking simultaneously in a cocktail party or when you need common sense.
Microsoft admitted there's still tons of work to be done on recognising various accents, speaking styles and languages – not to mention comprehending conversations in crowded rooms with a distant mic.
And although IBM may claim that a 5.1 per cent error rate on this dataset would be human-level recognition, Postma said: "That's marketing, not science."
Phil Woodland, an information engineer at Cambridge uni who specialises in speech recognition and has worked on the same dataset before, told The Reg that "the error rates have come down significantly" since this problem was tackled in the early 1990s (using one 2004 telephone conversation dataset called RT-04 IBM researchers achieved an error rate of 15.2 per cent).
He pointed out that in addition to recognising speech between strangers, IBM's new paper also transcribed a dataset for speech between family members, who would speak casually (achieving an error rate of 10.3 per cent). By comparison, Microsoft's paper only tackled the "easier" problem – when strangers speak their voice is more formal and easier to understand.
He says it's difficult to "pin down" a metric for human performance since it can vary from task to task. There's a chance the Microsoft algorithms might actually perform worse on the harder dataset or get similar numbers to IBM, he said.
It's also unclear if the Microsoft algorithms could apply to other datasets. It's possible that the researchers' algorithms might be tuned to work specifically on telephone conversations, and would not transfer to tasks such as voice search or transcribing broadcast data from media archives. ®