Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model

Google Scholar

Abstract

Multilingual end-to-end (E2E) models have shown great
promise as a means to expand coverage of the world’s lan-
guages by automatic speech recognition systems. They im-
prove over monolingual E2E systems, especially on low re-
source languages, and simplify training and serving by elimi-
nating language-specific acoustic, pronunciation, and language
models. This work aims to develop an E2E multilingual system
which is equipped to operate in low-latency interactive applica-
tions as well as handle the challenges of real world imbalanced
data. First, we present a streaming E2E multilingual model.
Second, we compare techniques to deal with imbalance across
languages. We find that a combination of conditioning on a
language vector and training language-specific adapter layers
produces the best model. The resulting E2E multilingual model
system achieves lower word error rate (WER) than state-of-the-
art conventional monolingual models by at least 10% relative
on every language.