Microsoft on Tuesday said that its researchers have "made a major breakthrough in speech recognition."
In a paper [PDF] published a day earlier, Microsoft machine learning researchers describe how they developed an automated system that can recognize recorded speech as well as a professional transcriptionist.
Using the NIST 2000 dataset of recorded calls, Microsoft's software performed slightly (0.4 per cent) better than the error rate the company attributes to professional transcriptionists (5.9 per cent) for the Switchboard portion of the data, in which strangers discuss a specified topic.
It saw a similarly narrow margin of success with the CallHome portion of the data – in which family members converse without guidelines – where the human transcription error rate was 11.3 per cent.
A month ago, Microsoft's researchers reported that their software had achieved a 6.3 per cent word error rate. In May, 2015, Google said it had achieved an 8 per cent error rate with its speech recognition technology. Such rapid progress underscores the intense interest in machine learning and artificial intelligence at technology companies.
"This marks the first time that human parity has been reported for conversational speech," the researchers said in their paper, attributing their success to the use of convolutional and LSTM (long short term memory) neural networks, and to techniques that improve the accuracy of data models like spatial smoothing. They also said that they relied on Microsoft's Computational Network Toolkit (CNTK), a machine learning framework the company has made available as an open source project.
Geoffrey Zweig, manager of Microsoft's speech and dialog research group, hailed the achievement as the culmination of over 20 years of effort.
To get there, Microsoft moved the goalpost a bit. The company's researchers dispensed with a 4 per cent error rate cited in a 1997 paper [PDF] for spontaneous conversations over a telephone line. That error rate estimate, they said, "is attributed to a 'personal communication,' and the actual source of this number is ephemeral."
When human transcribers evaluated the same audio files as Microsoft's software, their error rates were 5.9 per cent and 11.3 per cent respectively. Hence, the researchers deemed it inappropriate to use a single, anecdotal figure as the number to beat.
Microsoft expects its speech recognition advance will help improve its Cortana personal assistant software, among other products. And it emphasizes that achieving parity with human transcriptionists shouldn't be confused with perfection, because humans make mistakes too.
Cortana evidently can benefit from further improvement. Last month, security firm Sophos advised against relying on Cortana for making emergency calls, based on an account of a UK woman who used the software to dial the local police in order to report an accident and was directed to authorities in the US.
In the future, those in need of aid might consider calling out to idle transcriptionists. ®
Sponsored: Ransomware has gone nuclear