Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis
Abstract
We describe a sequence-to-sequence neural network which can directly generate speech waveforms from text inputs. The architecture extends the Tacotron model by incorporating a normalizing flow in the decoder loop. Output waveforms are modeled as a sequence of non-overlapping fixed-length frames, each one containing hundreds of samples. The inter-dependencies of waveform samples within each frame are modeled using the normalizing flow, enabling parallel training and synthesis. Longer-term dependencies are handled autoregressively by conditioning each flow on its preceding frames. The model allows for straightforward optimization towards the maximum likelihood objective, without utilizing intermediate spectral features nor additional loss terms. Contemporary state-of-the-art TTS systems use a sequence of separately learned models: one (such as Tacotron) which generates intermediate features (such as spectograms) from text, followed by a vocoder model (such as WaveRNN) which generates waveform samples from the intermediate features. The proposed system, in contrast, does not use a fixed intermediate representation ,and learns all parameters end-to-end. We demonstrate (to the best of our knowledge) the first system in the literature to do so successfully. Experiments show that the quality of speech generated from the proposed model is nearly competitive with the state-of-the-art neural TTS methods, with significantly improved generation speed.