WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

Nanxin Chen
Yu Zhang
Mohammad Norouzi
Najim Dehak
William Chan
Interspeech (2021)

Abstract

This paper introduces WaveGrad 2, an end-to-end non-autoregressive generative model for text-to-speech synthesis trained to estimate the gradients of the data density. Unlike recent TTS systems which are a cascade of separately learned models, during training the proposed model requires only text or phoneme sequence, learns all parameters end-to-end without intermediate features, and can generate natural speech audio with great varieties. This is achieved by the score matching objective, which optimizes the network to model the score function of the real data distribution. Output waveforms are generated using an iterative refinement process beginning from a random noise sample. Like our prior work, WaveGrad 2 offers a natural way to trade inference speed for sample quality by adjusting the number of refinement steps. Experiments reveal that the model can generate high fidelity audio, closing the gap between end-to-end and contemporary systems, approaching the performance of a state-of-the-art neural TTS system. We further carry out various ablations to study the impact of different model configurations.