Mohammadreza Ghodsi

I am a research scientist and software engineer working on Language Models (LMs), that are used in Automated Speech Recognition. I focus on Contextual LMs, which use non-linguistic signals (such as geographical location, time of day, etc.) in order to enhance traditional LMs.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Semi- and self-supervised training techniques have the potential to improve performance of speech recognition systems without additional transcribed speech data. In this work, we demonstrate the efficacy of two approaches to semi-supervision for automated speech recognition. The two approaches lever-age vast amounts of available unspoken text and untranscribed audio. First, we present factorized multilingual speech synthesis to improve data augmentation on unspoken text. Next, we present an online implementation of Noisy Student Training to incorporate untranscribed audio. We propose a modified Sequential MixMatch algorithm with iterative learning to learn from untranscribed speech. We demonstrate the compatibility of these techniques yielding a relative reduction of word error rate of up to 14.4% on the voice search task. View details
    RNN-Transducer with stateless prediction network
    James Apfel
    Rodrigo Cabrera
    Xiaofeng Liu
    ICASSP 2020, IEEE, pp. 7049-7053
    Preview abstract The RNN-Transducer (RNNT) outperforms classic Automatic Speech Recognition (ASR) systems when a large amount of supervised training data is available. For low-resource languages, the RNNT models overfit, and can not directly take advantage of additional large text corpora as in classic ASR systems. We focus on the prediction network of the RNNT, since it is believed to be analogous to the Language Model (LM) in the classic ASR systems. We pre-train the prediction network with text-only data, which is not helpful. Moreover, removing the recurrent layers from the prediction network, which makes the prediction network stateless, performs virtually as well as the original RNNT model, when using wordpieces. The stateless prediction network does not depend on the previous output symbols, except the last one. Therefore it simplifies the RNNT architectures and the inference. Our results suggest that the RNNT prediction network does not function as the LM in classical ASR. Instead, it merely helps the model align to the input audio, while the RNNT encoder and joint networks capture both the acoustic and the linguistic information. View details
    Preview abstract Maximum Entropy (MaxEnt) Language Models (LMs) are powerful models that can incorporate linguistic and non-linguistic contextual signals in a unified framework, by optimizing a convex loss function. In addition to their flexibility, a key advantage is their scalability, in terms of model size and the amount of data that can be used during training. We present the following two contributions to MaxEnt training: (1) By leveraging smaller amounts of transcribed data, we demonstrate that a MaxEnt LM trained on various types of corpora can be easily adapted to better match the test distribution of speech recognition; (2) A novel adaptive-training approach that efficiently models multiple types of non-linguistic features in a universal model. We test the impact of these approaches on Google's state-of-the-art speech recognizer for the task of voice-search transcription and dictation. Training 10B parameter models utilizing a corpus of up to 1T words, we show large reductions in word error rate from adaptation across multiple languages. Also, human evaluations show strong significant improvements on a wide range of domains from using non-linguistic signals. For example, adapting to geographical domains (e.g., US States and cities) affects about 4% of test utterances, with 2:1 wins to loss ratio. View details
    Preview abstract It has been shown in the literature that automatic speech recognition systems can greatly benefit from contextual in- formation [ref]. The contextual information can be used to simplify the search and improve recognition accuracy. The types of useful contextual information can include the name of the application the user is in, the contents on the user’s phone screen, user’s location, a certain dialog state, etc. Building a separate language model for each of these types of context is not feasible due to limited resources or limited amount of training data. In this paper we describe an approach for unsupervised learning of contextual information and automatic building of contextual (biasing) models. Our approach can be used to build a large number of small contextual models from a lim- ited amount of available unsupervised training data. We de- scribe how n-grams relevant for a particular context are au- tomatically selected as well as how an optimal size of a final contextual model built is chosen. Our experimental results show great accuracy improvements for several types of con- text. View details
    No Results Found