- Yun Zhu
- Parisa Haghani
- Anshuman Tripathi
- Bhuvana Ramabhadran
- Brian Farris
- Hainan Xu
- Han Lu
- Hasim Sak
- Isabel Leal
- Neeraj Gaur
- Pedro Jose Moreno Mengibar
- Qian Zhang
Abstract
Multilingual automatic speech recognition systems can transcribe utterances from different languages. These systems are attractive from different perspectives: they can provide quality improvements, specially for lower resource languages, and simplify the training and deployment procedure. End-to-end speech recognition has further simplified multilingual modeling as one model, instead of several components of a classical system, have to be unified. In this paper, we investigate a streamable end-to-end multilingual system based on the Transformer Transducer. We propose several techniques for adapting the self-attention architecture based on the language id. We analyze the trade-offs of each method with regards to quality gains and number of additional parameters introduced. We conduct experiments in a real-world task consisting of five languages. Our experimental results demonstrate $\sim$10\% and $\sim$15\% relative gain over the baseline multilingual model.
Research Areas
Learn more about how we do research
We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work