Generating diverse and natural text-to-speech samples using quantized fine-grained VAE and autoregressive prosody prior

Guangzhi Sun

Yu Zhang

Ron J. Weiss

Yuan Cao

Heiga Zen

Andrew Rosenberg

Bhuvana Ramabhadran

Yonghui Wu

ICASSP (2020)

Download Google Scholar

Abstract

Recently proposed approaches for fine-grained prosody control of end-to-end text-to-speech samples enable precise control of the prosody of synthesized speech.
Such models incorporate a fine-grained variational autoencoder (VAE) structure into a sequence-to-sequence model, extracting latent prosody features for each input token (e.g.\ phonemes).
Generating samples using the standard VAE prior, an independent Gaussian at each time step, results in very unnatural and discontinuous speech, with dramatic variation between phonemes.
In this paper we propose a sequential prior in the discrete latent space which can be used to generate more natural samples.
This is accomplished by discretizing the latent prosody features using vector quantization, and training an autoregressive (AR) prior model over the result.
The AR prior is learned separately from the training of the posterior.
We evaluate the approach using subjective listening tests, objective metrics of automatic speech recognition (ASR) performance, as well as measurements of prosody attributes including volume, pitch, and phoneme duration.
Compared to the fine-grained VAE baseline, the proposed model achieves equally good copy synthesis reconstruction performance, but significantly improves naturalness in sample generation.
The diversity of the prosody in random samples better matches that of the real speech.
Furthermore, initial experiments demonstrate that samples generated from the quantized latent sapce can be used as an effective data augmentation strategy to improve ASR performance.

Research Areas

Speech Processing

Defining the technology of today and tomorrow.

Philosophy

People

Research areas

Foundational ML & Algorithms

Computing Systems & Quantum AI

Science, AI & Society

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Generating diverse and natural text-to-speech samples using quantized fine-grained VAE and autoregressive prosody prior

Abstract

Research Areas

Meet the teams driving innovation