Google Research

Fully-hierarchical Fine-grained Prosody Modeling for Interpretable speech synthesis

ICASSP (2020)


We propose a hierarchical, fine-grained and interpretable latent model for prosody based on the Tacotron~2. This model achieves multi-resolution modeling by conditioning finer level prosody representations on coarser level ones. In addition, the hierarchical conditioning is also imposed across all latent dimensions using a conditional VAE structure which exploits an auto-regressive structure. Reconstruction performance is evaluated with the $F_0$ frame error (FFE) and the mel-cepstral distortion (MCD) which illustrates the new structure does not degrade the model. Interpretations of prosody attributes are provided together with the comparison between word-level and phone-level prosody representations. Moreover, both qualitative and quantitative evaluations are used to demonstrate the improvement in the disentanglement of the latent dimensions.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work