Generating diverse and natural text-to-speech samples using quantized fine-grained VAE and autoregressive prosody prior

Abstract

Recently proposed approaches for fine-grained prosody control of end-to-end text-to-speech samples enable precise control of the prosody of synthesized speech.
Such models incorporate a fine-grained variational autoencoder (VAE) structure into a sequence-to-sequence model, extracting latent prosody features for each input token (e.g.\ phonemes).
Generating samples using the standard VAE prior, an independent Gaussian at each time step, results in very unnatural and discontinuous speech, with dramatic variation between phonemes.
In this paper we propose a sequential prior in the discrete latent space which can be used to generate more natural samples.
This is accomplished by discretizing the latent prosody features using vector quantization, and training an autoregressive (AR) prior model over the result.
The AR prior is learned separately from the training of the posterior.
We evaluate the approach using subjective listening tests, objective metrics of automatic speech recognition (ASR) performance, as well as measurements of prosody attributes including volume, pitch, and phoneme duration.
Compared to the fine-grained VAE baseline, the proposed model achieves equally good copy synthesis reconstruction performance, but significantly improves naturalness in sample generation.
The diversity of the prosody in random samples better matches that of the real speech.
Furthermore, initial experiments demonstrate that samples generated from the quantized latent sapce can be used as an effective data augmentation strategy to improve ASR performance.

Research Areas