Google Research

Parallel Tacotron: Non-Autoregressive and Controllable TTS

ICASSP (2021)

Abstract

Although neural end-to-end text-to-speech models can synthesizehighly natural speech, there is still a room for improvements in itsefficiency during inference. This paper proposes a non-autoregressiveneural text-to-speech model augmented with a variational autoencoder-based residual encoder. This model, calledParallel Tacotron, is highlyparallelizable during both training and inference, allowing efficientsynthesis on modern parallel hardware. The use of the variationalautoencoder helps to relax the one-to-many mapping nature of thetext-to-speech problem. To further improve the naturalness, weintroduce an iterative spectrogram loss, which is inspired by iterativerefinement, and lightweight convolution, which can efficiently capturelocal contexts. Experimental results show that Parallel Tacotronmatches a strong autoregressive baseline in subjective naturalnesswith significantly decreased inference time.

Research Areas

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work