Building Large-Vocabulary ASR Systems for Languages Without Any Audio Training Data

Manasa Prasad; Daan van Esch; Sandy Ritchie; Jonas Fromseier Mortensen

Building Large-Vocabulary ASR Systems for Languages Without Any Audio Training Data

Manasa Prasad

Daan van Esch

Sandy Ritchie

Jonas Fromseier Mortensen

Proceedings of Interspeech 2019

Download Google Scholar

Abstract

When building automatic speech recognition (ASR) systems, typically some amount of audio and text data in the target language is needed. While text data can be obtained relatively easily across many languages, transcribed audio data is challenging to obtain. This presents a barrier to making voice technologies available in more languages of the world. In this paper, we present a way to build an ASR system for a language even in the absence of any audio training data in that language at all. We do this by simply re-using an existing acoustic model from a phonologically similar language, without any kind of modification or adaptation towards the target language. The basic insight is that, if two languages are sufficiently similar in terms of their phonological system, an acoustic model should hold up relatively well when used for another language. We describe how we tailor our pronunciation models to enable such re-use, and show experimental results across a number of languages from various language families. We also provide a theoretical analysis of situations in which this approach is likely to work. Our results show that is possible to achieve less than 20% word error rate (WER) using this method.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Building Large-Vocabulary ASR Systems for Languages Without Any Audio Training Data

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs