Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model
Multilingual end-to-end (E2E) models have shown great promise as a means to expand coverage of the world’s lan- guages by automatic speech recognition systems. They im- prove over monolingual E2E systems, especially on low re- source languages, and simplify training and serving by elimi- nating language-specific acoustic, pronunciation, and language models. This work aims to develop an E2E multilingual system which is equipped to operate in low-latency interactive applica- tions as well as handle the challenges of real world imbalanced data. First, we present a streaming E2E multilingual model. Second, we compare techniques to deal with imbalance across languages. We find that a combination of conditioning on a language vector and training language-specific adapter layers produces the best model. The resulting E2E multilingual model system achieves lower word error rate (WER) than state-of-the- art conventional monolingual models by at least 10% relative on every language.