Face of a Robot, Voice of an Angel?
Robotic speech can still get much, much better. Enter Google’s machine learning division, DeepMind, which has just announced a new text-to-speech technology called WaveNet. Unlike many previous approaches, the system uses convolutional neural networks to create an entire raw audio waveform from the ground up. Trained using real audio recordings, when provided with text it works through a sentence stepwise, generating short utterances which are combined to create the final waveform. DeepMind claims that the results sound more natural than the best existing systems, but you can judge for yourself by listening to some examples. There’s one hitch: the technique, which needs to produce perhaps 16,000 data points for every second of audio it creates, is incredibly computationally expensive. According to the Financial Times, that means it won’t, for now, be used within any of Google’s products.