Google Research

Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Alignments

(2021)

Abstract

We present a state-of-the-art non-autoregressive Text-To-Speech model. The model called Parallel Tacotron 2 learns to synthesize speech with good quality without supervised duration signals and other assumptions about the token-to-frame mapping. Specifically, we introduce a novel learned attention mechanism and an iterative reconstruction loss based on Soft Dynamic Time Warping. We show that this new unsupervised model outperforms the baselines in naturalness in several diverse multi speaker evaluations. Further, we show that the explicit duration model that the model has learned can be used to control the synthesized speech.

Research Areas

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work