Jump to Content

Arindrima Datta

Research Areas

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract Automated speech recognition (ASR) coverage of the world's languages continues to expand. Yet, as data-demanding neural network models continue to revolutionize the field, it poses a challenge for data-scarce languages. Multilingual models allow for the joint training of data-scarce and data-rich languages enabling data and parameter sharing. One of the main goals of multilingual ASR is to build a single model for all languages while reaping the benefits of sharing on data-scarce languages without impacting performance on the data-rich languages. However, most state-of-the-art multilingual models require the encoding of language information and therefore are not as flexible or scalable when expanding to newer languages. Language independent multilingual models help to address this, as well as, are more suited to multicultural societies such as in India, where languages overlap and are frequently used together by native speakers. In this paper, we propose a new approach to building a language-agnostic multilingual ASR system using transliteration. This training strategy maps all languages to one writing system through a many-to-one transliteration transducer that maps similar sounding acoustics to one target sequences such as, graphemes, phonemes or wordpieces resulting in improved data sharing and reduced phonetic confusions. We propose a training strategy that maps all languages to one writing system through a many-to-one transliteration transducer. We show with four Indic languages, namely, Hindi, Bengali, Tamil and Kannada, that the resulting multilingual model achieves a performance comparable to a language-dependent multilingual model, with an improvement of up to 15\% relative on the data-scarce language. View details
    Preview abstract Multilingual end-to-end (E2E) models have shown great promise as a means to expand coverage of the world’s lan- guages by automatic speech recognition systems. They im- prove over monolingual E2E systems, especially on low re- source languages, and simplify training and serving by elimi- nating language-specific acoustic, pronunciation, and language models. This work aims to develop an E2E multilingual system which is equipped to operate in low-latency interactive applica- tions as well as handle the challenges of real world imbalanced data. First, we present a streaming E2E multilingual model. Second, we compare techniques to deal with imbalance across languages. We find that a combination of conditioning on a language vector and training language-specific adapter layers produces the best model. The resulting E2E multilingual model system achieves lower word error rate (WER) than state-of-the- art conventional monolingual models by at least 10% relative on every language. View details
    No Results Found