Fully-hierarchical Fine-grained Prosody Modeling for Interpretable speech synthesis
Abstract
We propose a hierarchical, fine-grained and interpretable latent model for prosody based on the Tacotron~2. This model achieves multi-resolution modeling by conditioning finer level prosody representations on coarser level ones. In addition, the hierarchical conditioning is also imposed across all latent dimensions using a conditional VAE structure which exploits an auto-regressive structure. Reconstruction performance is evaluated with the $F_0$ frame error (FFE) and the mel-cepstral distortion (MCD) which illustrates the new structure does not degrade the model. Interpretations of prosody attributes are provided together with the comparison between word-level and phone-level prosody representations. Moreover, both qualitative and quantitative evaluations are used to demonstrate the improvement in the disentanglement of the latent dimensions.