Predicting Pronunciations with Syllabification and Stress with Recurrent Neural Networks

Daan van Esch; Mason Chua; Kanishka Rao

Predicting Pronunciations with Syllabification and Stress with Recurrent Neural Networks

Daan van Esch

Mason Chua

Kanishka Rao

Proceedings of Interspeech 2016

Download Google Scholar

Abstract

Word pronunciations, consisting of phoneme sequences and the associated syllabification and stress patterns, are vital for both speech recognition and text-to-speech (TTS) systems. For speech recognition phoneme sequences for words may be learned from audio data. We train recurrent neural network (RNN) based models to predict the syllabification and stress pattern for such pronunciations making them usable for TTS. We find these RNN models significantly outperform naive rulebased models for almost all languages we tested. Further, we find additional improvements to the stress prediction model by using the spelling as features in addition to the phoneme sequence. Finally, we train a single RNN model to predict the phoneme sequence, syllabification and stress for a given word. For several languages, this single RNN outperforms similar models trained specifically for either phoneme sequence or stress prediction. We report an exhaustive comparison of these approaches for twenty languages.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Predicting Pronunciations with Syllabification and Stress with Recurrent Neural Networks

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs