WaveNet: A Generative Model for Raw Audio

Aäron van den Oord; Sander Dieleman; Heiga Zen; Karen Simonyan; Oriol Vinyals; Alexander Graves; Nal Kalchbrenner; Andrew Senior; Koray Kavukcuoglu

WaveNet: A Generative Model for Raw Audio

Aäron van den Oord

Sander Dieleman

Heiga Zen

Karen Simonyan

Oriol Vinyals

Alexander Graves

Nal Kalchbrenner

Andrew Senior

Koray Kavukcuoglu

Arxiv (2016)

Download Google Scholar

Abstract

This paper introduces WaveNet, a deep generative neural network trained end-to-end to model raw audio waveforms, which can be applied to text-to-speech and music generation. Current approaches to text-to-speech are focused on non-parametric, example-based generation (which stitches together short audio signal segments from a large training set), and parametric, model-based generation (in which a model generates acoustic features synthesized into a waveform with a vocoder). In contrast, we show that directly generating wideband audio signals at tens of thousands of samples per second is not only feasible, but also achieves results that significantly outperform the prior art. A single trained WaveNet can be used to generate different voices by conditioning on the speaker identity. We also show that the same approach can be used for music audio generation and speech recognition.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

WaveNet: A Generative Model for Raw Audio

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs