This article is more than 1 year old
Google's AI finds its voice ... and it's surprisingly human
Talk like a robot? I'm sorry Dave, I'm afraid I can do that
Google has figured out how to use artificial intelligence to make robot sounds more human, according to a new paper.
Using its “WaveNet” model, Google’s AI company in the UK, DeepMind, claims to have created a natural machine-to-human speech that halves “the gap with human performance."
Machine babble often sounds emotionally flat and robotic because it’s difficult to capture the natural nuances of human speech.
Many systems are still based on a method called “concatenative text-to-speech” (TTS), which sounds out words by stringing together a large collection of phonetic sounds.
Researchers have managed to improve speech synthesis a bit by using “parametric text-to-speech,” which uses a vocoder – a device that employs a set of algorithms to process sounds from speech.
Although parametric TTS is more complex than concatenative TTS, the result can be even worse for syllabic languages like English, according to Google.
Like many existing AI projects, DeepMind has turned to using neural networks – a system modelled on how the human brain works – capable of processing huge heaps of data to perform a specific task.
WaveNet directly models the “raw waveform of the audio signal, one sample at a time.” It requires high computing power, because raw audio produces about 16,000 samples per second at many time scales.
The neural network is trained by recording human speech. As the sounds are sampled, a value is “drawn from a probability distribution computed by the network.” The value is then pumped back into the input and the system has to predict the next step for each sample. Building up these samples from a wider range of human voices makes the result more realistic.
Google DeepMind claims that using WaveNet in US English and Mandarin Chinese has closed the gap between human and AI speech by 50 per cent, according to mean opinion scores calculated by subjective evaluation.
After training, to transform text to speech, WaveNet has to first process the text. The words in the text are unscrambled into a “sequence of linguistic and phonetic features” that feeds into the neural network.
Training WaveNets without text results in gibberish. The AI system has to make it up as it goes along, stringing together human-sounding noises that don’t make any sense.
In DeepMind’s blog post, it hasn’t said how this technology will be applied. But the team has used it to produce AI-made piano music. You can hear how the AI speaks and makes music here. ®