WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

Nanxin Chen; Yu Zhang; Heiga Zen (Byungha Chun); Ron J. Weiss; Mohammad Norouzi; Najim Dehak; William Chan

WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

Nanxin Chen

Yu Zhang

Heiga Zen (Byungha Chun)

Ron J. Weiss

Mohammad Norouzi

Najim Dehak

William Chan

Interspeech (2021)

Download Google Scholar

Abstract

This paper introduces WaveGrad 2, an end-to-end non-autoregressive generative model for text-to-speech synthesis trained to estimate the gradients of the data density. Unlike recent TTS systems which are a cascade of separately learned models, during training the proposed model requires only text or phoneme sequence, learns all parameters end-to-end without intermediate features, and can generate natural speech audio with great varieties. This is achieved by the score matching objective, which optimizes the network to model the score function of the real data distribution. Output waveforms are generated using an iterative refinement process beginning from a random noise sample. Like our prior work, WaveGrad 2 offers a natural way to trade inference speed for sample quality by adjusting the number of refinement steps. Experiments reveal that the model can generate high fidelity audio, closing the gap between end-to-end and contemporary systems, approaching the performance of a state-of-the-art neural TTS system. We further carry out various ablations to study the impact of different model configurations.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs