Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis

Ron J. Weiss; RJ Skerry-Ryan; Eric Battenberg; Soroosh Mariooryad; Diederik P. Kingma

Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis

Ron J. Weiss

RJ Skerry-Ryan

Eric Battenberg

Soroosh Mariooryad

Diederik P. Kingma

ICASSP (2021)

Download Google Scholar

Abstract

We describe a sequence-to-sequence neural network which can directly generate speech waveforms from text inputs. The architecture extends the Tacotron model by incorporating a normalizing flow in the decoder loop. Output waveforms are modeled as a sequence of non-overlapping fixed-length frames, each one containing hundreds of samples. The inter-dependencies of waveform samples within each frame are modeled using the normalizing flow, enabling parallel training and synthesis. Longer-term dependencies are handled autoregressively by conditioning each flow on its preceding frames. The model allows for straightforward optimization towards the maximum likelihood objective, without utilizing intermediate spectral features nor additional loss terms. Contemporary state-of-the-art TTS systems use a sequence of separately learned models: one (such as Tacotron) which generates intermediate features (such as spectograms) from text, followed by a vocoder model (such as WaveRNN) which generates waveform samples from the intermediate features. The proposed system, in contrast, does not use a fixed intermediate representation ,and learns all parameters end-to-end. We demonstrate (to the best of our knowledge) the first system in the literature to do so successfully. Experiments show that the quality of speech generated from the proposed model is nearly competitive with the state-of-the-art neural TTS methods, with significantly improved generation speed.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs