- Aäron van den Oord
- Sander Dieleman
- Heiga Zen
- Karen Simonyan
- Oriol Vinyals
- Alexander Graves
- Nal Kalchbrenner
- Andrew Senior
- Koray Kavukcuoglu
Abstract
This paper introduces WaveNet, a deep generative neural network trained end-to-end to model raw audio waveforms, which can be applied to text-to-speech and music generation. Current approaches to text-to-speech are focused on non-parametric, example-based generation (which stitches together short audio signal segments from a large training set), and parametric, model-based generation (in which a model generates acoustic features synthesized into a waveform with a vocoder). In contrast, we show that directly generating wideband audio signals at tens of thousands of samples per second is not only feasible, but also achieves results that significantly outperform the prior art. A single trained WaveNet can be used to generate different voices by conditioning on the speaker identity. We also show that the same approach can be used for music audio generation and speech recognition.
Research Areas
Learn more about how we do research
We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work