Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Alignments

Isaac Elias
Jonathan Shen
Yu Zhang
Ye Jia
(2021)

Abstract

We present a state-of-the-art non-autoregressive Text-To-Speech model. The model called Parallel Tacotron 2 learns to synthesize speech with good quality without supervised duration
signals and other assumptions about the token-to-frame mapping. Specifically, we introduce a novel learned attention mechanism and an iterative reconstruction loss based on Soft Dynamic
Time Warping. We show that this new unsupervised model outperforms the baselines in naturalness in several diverse multi speaker evaluations. Further, we show that the explicit duration model that the model has learned can be used to control the synthesized speech.

Research Areas