Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Alignments

Isaac Elias

Heiga Zen

Jonathan Shen

Yu Zhang

Ye Jia

RJ Skerry-Ryan

Yonghui Wu

(2021)

Download Google Scholar

Abstract

We present a state-of-the-art non-autoregressive Text-To-Speech model. The model called Parallel Tacotron 2 learns to synthesize speech with good quality without supervised duration signals and other assumptions about the token-to-frame mapping. Specifically, we introduce a novel learned attention mechanism and an iterative reconstruction loss based on Soft Dynamic Time Warping. We show that this new unsupervised model outperforms the baselines in naturalness in several diverse multi speaker evaluations. Further, we show that the explicit duration model that the model has learned can be used to control the synthesized speech.

Research Areas

Speech Processing

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Alignments

Abstract

Research Areas

Learn more about how we conduct our research

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Alignments

Abstract

Research Areas

Learn more about how we conduct our research

AI/ML Foundations  & Capabilities