Deep Mixture Density Networks for Acoustic Modeling in Statistical Parametric Speech Synthesis

Heiga Zen; Andrew Senior

Deep Mixture Density Networks for Acoustic Modeling in Statistical Parametric Speech Synthesis

Heiga Zen

Andrew Senior

Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE (2014), pp. 3872-3876

Google Scholar

Abstract

Statistical parametric speech synthesis (SPSS) using deep neural networks (DNNs) has shown its potential to produce naturally-sounding synthesized speech. However, there are limitations in the current implementation of DNN-based acoustic modeling for speech synthesis, such as the unimodal nature of its objective function and its lack of ability to predict variances. To address these limitations, this paper investigates the use of a mixture density output layer. It can estimate full probability density functions over real-valued output features conditioned on the corresponding input features. Experimental results in objective and subjective evaluations show that the use of the mixture density output layer improves the prediction accuracy of acoustic features and the naturalness of the synthesized speech.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Deep Mixture Density Networks for Acoustic Modeling in Statistical Parametric Speech Synthesis

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs