Fully-hierarchical Fine-grained Prosody Modeling for Interpretable speech synthesis

Guangzhi Sun; Yu Zhang; Ron J. Weiss; Yuan Cao; Heiga Zen; Yonghui Wu

Fully-hierarchical Fine-grained Prosody Modeling for Interpretable speech synthesis

Guangzhi Sun

Yu Zhang

Ron J. Weiss

Yuan Cao

Heiga Zen

Yonghui Wu

ICASSP (2020)

Google Scholar

Abstract

We propose a hierarchical, fine-grained and interpretable latent model for prosody based on the Tacotron~2. This model achieves multi-resolution modeling by conditioning finer level prosody representations on coarser level ones. In addition, the hierarchical conditioning is also imposed across all latent dimensions using a conditional VAE structure which exploits an auto-regressive structure. Reconstruction performance is evaluated with the $F_0$ frame error (FFE) and the mel-cepstral distortion (MCD) which illustrates the new structure does not degrade the model. Interpretations of prosody attributes are provided together with the comparison between word-level and phone-level prosody representations. Moreover, both qualitative and quantitative evaluations are used to demonstrate the improvement in the disentanglement of the latent dimensions.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Fully-hierarchical Fine-grained Prosody Modeling for Interpretable speech synthesis

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs