Statistical Parametric Speech Synthesis Using Deep Neural Networks

Heiga Zen; Andrew Senior; Mike Schuster

Statistical Parametric Speech Synthesis Using Deep Neural Networks

Heiga Zen

Andrew Senior

Mike Schuster

Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE (2013), pp. 7962-7966

Google Scholar

Abstract

Conventional approaches to statistical parametric speech synthesis typically use decision tree-clustered context-dependent hidden Markov models (HMMs) to represent probability densities of speech parameters given texts. Speech parameters are generated from the probability densities to maximize their output probabilities, then a speech waveform is reconstructed from the generated parameters. This approach is reasonably effective but has a couple of limitations, e.g. decision trees are inef?cient to model complex context dependencies. This paper examines an alternative scheme that is based on a deep neural network (DNN). The relationship between input texts and their acoustic realizations is modeled by a DNN. The use of the DNN can address some limitations of the conventional approach. Experimental results show that the DNN-based systems outperformed the HMM-based systems with similar numbers of parameters.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Statistical Parametric Speech Synthesis Using Deep Neural Networks

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs