- Isaac Elias
- Heiga Zen
- Jonathan Shen
- Yu Zhang
- Ye Jia
- Ron J. Weiss
- Yonghui Wu
Abstract
Although neural end-to-end text-to-speech models can synthesizehighly natural speech, there is still a room for improvements in itsefficiency during inference. This paper proposes a non-autoregressiveneural text-to-speech model augmented with a variational autoencoder-based residual encoder. This model, calledParallel Tacotron, is highlyparallelizable during both training and inference, allowing efficientsynthesis on modern parallel hardware. The use of the variationalautoencoder helps to relax the one-to-many mapping nature of thetext-to-speech problem. To further improve the naturalness, weintroduce an iterative spectrogram loss, which is inspired by iterativerefinement, and lightweight convolution, which can efficiently capturelocal contexts. Experimental results show that Parallel Tacotronmatches a strong autoregressive baseline in subjective naturalnesswith significantly decreased inference time.
Research Areas
Learn more about how we do research
We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work