Parallel Tacotron: Non-Autoregressive and Controllable TTS
Abstract
Although neural end-to-end text-to-speech models can synthesizehighly natural speech, there is still a room for improvements in itsefficiency during inference. This paper proposes a non-autoregressiveneural text-to-speech model augmented with a variational autoencoder-based residual encoder. This model, calledParallel Tacotron, is highlyparallelizable during both training and inference, allowing efficientsynthesis on modern parallel hardware. The use of the variationalautoencoder helps to relax the one-to-many mapping nature of thetext-to-speech problem. To further improve the naturalness, weintroduce an iterative spectrogram loss, which is inspired by iterativerefinement, and lightweight convolution, which can efficiently capturelocal contexts. Experimental results show that Parallel Tacotronmatches a strong autoregressive baseline in subjective naturalnesswith significantly decreased inference time.