Manasa Prasad
I work on internationalization of Speech and Keyboard, specifically building quality recognition systems with as little data as possible.
Authored Publications
Sort By
Preview abstract
Pronunciation modeling is a key task for building speech technology in new languages, and while solid grapheme-to-phoneme (G2P) mapping systems exist, language coverage can stand to be improved. The information needed to build G2P models for many more languages can easily be found on Wikipedia, but unfortunately, it is stored in disparate formats. We report on a system we built to mine a pronunciation data set in 819 languages from loosely structured tables within Wikipedia. The data includes phoneme inventories, and for 63 low-resource languages, also includes the grapheme-to-phoneme (G2P) mapping. 54 of these languages do not have easily findable G2P mappings online otherwise. We turned the information from Wikipedia into a structured, machine-readable TSV format, and make the resulting data set publicly available so it can be improved further and used in a variety of applications involving low-resource languages.
View details
Self-Adaptive Distillation for Multilingual Speech Recognition: Leveraging Student Independence
Brian Farris
Pedro Jose Moreno Mengibar
Yun Zhu
Interspeech 2021 (to appear)
Preview abstract
With a large population of the world speaking more than one language, multilingual automatic speech recognition (ASR) has gained popularity in the recent years. While lower resource languages can benefit from quality improvements in a multilingual ASR system, including unrelated or higher resource languages in the mix often results in performance degradation. In this paper, we propose distilling from multiple teachers, with each language using its best teacher during training, to tackle this problem. We introduce self-adaptive distillation, a novel technique for automatic weighting of the distillation loss that uses the student/teachers confidences. We analyze the effectiveness of the proposed techniques on two real world use-cases and show that the performance of the multilingual ASR models can be improved by up to 11.5% without any increase in model capacity. Furthermore, we show that when our methods are combined with increase in model capacity, we can achieve quality gains of up to 20.7%.
View details
Mixture of Informed Experts for Multilingual Speech Recognition
Brian Farris
Pedro Jose Moreno Mengibar
Yun Zhu
ICASSP 2021, IEEE International Conference on Acoustics, Speech and Signal Processing (to appear)
Preview abstract
Multilingual speech recognition models are capable of recognizing speech in multiple different languages. When trained on related or low-resource languages, these models often outperform their monolingual counterparts. Similar to other forms of multi-task models, when the group of languages are unrelated, or when large amounts of training data is available, multilingual models can suffer from performance loss. We investigate the use of a mixture-of-expert approach to assign per-language parameters in the model to increase network capacity in a structured fashion. We introduce a novel variant of this approach, 'informed experts', which attempts to tackle inter-task conflicts by eliminating gradients from other tasks in the these task-specific parameters. We conduct experiments on a real-world task on English, French and four dialects of Arabic to show the effectiveness of our approach.
View details
Writing Across the World's Languages: Deep Internationalization for Gboard, the Google Keyboard
Elnaz Sarbar
Jeremy O'Brien
Evan Elizabeth Crew
Chieu Nguyen
arXiv cs.HC (2019)
Preview abstract
This technical report describes our deep internationalization program for Gboard, the Google Keyboard. Today, Gboard supports 900+ language varieties across 70+ writing systems, and this report describes how and why we added support for these language varieties from around the globe. Many languages of the world are increasingly used in writing on an everyday basis, and we describe the trends we see. We cover technological and logistical challenges in scaling up a language technology product like Gboard to hundreds of language varieties, and describe how we built systems and processes to operate at scale. Finally, we summarize the key take-aways from user studies we ran with speakers of hundreds of languages from around the world.
View details
Preview abstract
When building automatic speech recognition (ASR) systems, typically some amount of audio and text data in the target language is needed. While text data can be obtained relatively easily across many languages, transcribed audio data is challenging to obtain. This presents a barrier to making voice technologies available in more languages of the world. In this paper, we present a way to build an ASR system for a language even in the absence of any audio training data in that language at all. We do this by simply re-using an existing acoustic model from a phonologically similar language, without any kind of modification or adaptation towards the target language. The basic insight is that, if two languages are sufficiently similar in terms of their phonological system, an acoustic model should hold up relatively well when used for another language. We describe how we tailor our pronunciation models to enable such re-use, and show experimental results across a number of languages from various language families. We also provide a theoretical analysis of situations in which this approach is likely to work. Our results show that is possible to achieve less than 20% word error rate (WER) using this method.
View details
Mining Training Data for Language Modeling across the World’s Languages
Proceedings of the 6th International Workshop on Spoken Language Technologies for Under-resourced Languages (SLTU 2018)
Preview abstract
Building smart keyboards and speech recognition systems for new languages requires a large, clean text corpus to train n-gram language models on. We report our findings on how much text data can realistically be found on the web across thousands of languages. In addition, we describe an innovative, scalable approach to normalizing this data: all data sources are noisy to some extent, but this situation is even more severe for low-resource languages. To help clean the data we find across all languages in a scalable way, we built a pipeline to automatically derive the configuration for language-specific text normalization systems, which we describe here as well.
View details