Statistical Parametric Speech Synthesis Based on Speaker and Language Factorization

Norbert Braunschweiler
Sabine Buchholz
Mark J. F. Gales
Kate Knill
Sacha Krstulovic
Javier Latorre
IEEE Transactions on Audio, Speech, and Language Processing, 20(2012), pp. 1713-1724


An increasingly common scenario in building speech synthesis and recognition systems is training on inhomogeneous data. This paper proposes a new framework for estimating hidden Markov models on data containing both multiple speakers and multiple languages. The proposed framework, speaker and language factorization, attempts to factorize speaker-/language-specific characteristics in the data and then model them using separate transforms. Language-specific factors in the data are represented by transforms based on cluster mean interpolation with cluster-dependent decision trees. Acoustic variations caused by speaker characteristics are handled by transforms based on constrained maximum-likelihood linear regression. Experimental results on statistical parametric speech synthesis show that the proposed framework enables data from multiple speakers in different languages to be used to: train a synthesis system; synthesize speech in a language using speaker characteristics estimated in a different language; and adapt to a new language.

Research Areas