Google's DeepMind Develops an Entirely New Approach to TTS
Google bought U.K. artificial intelligence start-up DeepMind in 2014 for $532 million, and now the company is set to launch technology that makes computer-generated speech sound more natural, even better than Google's own text-to-speech solution.
The technology, called WaveNet, uses artificial intelligence and neural networking to mimic human speech by learning how to form the individual sound waves that the human voice creates.
In an article on DeepMind's blog posted late last week, the company explained how its technology works: "The input sequences are real waveforms recorded from human speakers. After training, we can sample the network to generate synthetic utterances."
"A single WaveNet is able to learn the characteristics of many different voices, male and female. To make sure it knew which voice to use for any given utterance, we conditioned the network on the identity of the speaker," the blog post continued.
Essentially, WaveNet directly models raw waveforms of audio signals, one sample at a time, to produce speech that it says is more natural-sounding. In fact, in a product sampling using English and Chinese speech, the system was found to outperform Google's own speech synthesis systems by 50 percent.
Current speech synthesis, or text-to-speech (TTS), technologies rely on a very large database of short speech fragments that are recorded from a single speaker and then recombined to form complete utterances. This concatenative process makes it difficult to modify the voice (for example switching to a different speaker, or altering the emphasis or emotion of their speech) without recording a whole new database, the company said in its blog post.
Another technology, called parametric TTS, generates speech based on information stored within the parameters of the model, and the contents and characteristics of the speech can be controlled via the inputs to the model. "So far, however, parametric TTS has tended to sound less natural than concatenative," it said in the blog post.
To model waveforms, DeepMind's WaveNet takes 16,000 samples per second to generate data that it can then convert into speech. For each sample, it has to predict what the sound wave should look like, which the company says requires a lot of computation power.
"Building up samples one step at a time like this is computationally expensive, but we have found it essential for generating complex, realistic-sounding audio," the company explained.
The blogpost says WaveNet can generate breathing and mouth movements and identify the characteristics of different voices, including male and female. The same technology can even be used to synthesize other audio signals, such as music.
The company added that it could add additional inputs into its AI model such as emotions or accents to make the speech "even more diverse and interesting."
But the real challenge, it admits, will be to reduce the cost and computational power requirements before the product can become commercially viable.