Hierarchical Generative Modeling for Controllable Speech Synthesis

Wei-Ning Hsu; Yu Zhang; Ron Weiss; Heiga Zen; Yonghui Wu; Yuxuan Wang; Yuan Cao; Ye Jia; Zhifeng Chen; Jonathan Shen; Patrick Nguyen; Ruoming Pang

Hierarchical Generative Modeling for Controllable Speech Synthesis

Wei-Ning Hsu

Yu Zhang

Ron Weiss

Heiga Zen

Yonghui Wu

Yuxuan Wang

Yuan Cao

Ye Jia

Zhifeng Chen

Jonathan Shen

Patrick Nguyen

Ruoming Pang

International Conference on Learning Representations (2019)

Download Google Scholar

Abstract

This paper proposes a neural end-to-end text-to-speech model which can control latent attributes in the generation of speech, that are rarely annotated in the training data (e.g. speaking styles, accents, background noise level, and recording conditions). The model is formulated as a conditional generative model with two levels of hierarchical latent variables. The first level is a categorical variable, which represents attribute groups (e.g. clean/noisy) and provides interpretability. The second level, conditioned on the first, is a multivariate Gaussian variable, which characterizes specific attribute configurations (e.g. noise level, speaking rate) and enables disentangled fine-grained control over these attributes. This amounts to using a Gaussian mixture model (GMM) for the latent distribution. Extensive evaluation of the proposed model demonstrates its ability to control the aforementioned attributes. In particular, it is capable of consistently synthesizing high-quality clean speech regardless of the quality of the training data for the target speaker.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Hierarchical Generative Modeling for Controllable Speech Synthesis

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs