Mohammadreza Ghodsi
I am a research scientist and software engineer working on Language Models (LMs), that are used in Automated Speech Recognition. I focus on Contextual LMs, which use non-linguistic signals (such as geographical location, time of day, etc.) in order to enhance traditional LMs.
Research Areas
Authored Publications
Sort By
Semi-Supervision in ASR: Sequential Mixmatch and Factorized TTS-Based Augmentation
Zhehuai Chen
Yu Zhang
Yinghui Huang
Jesse Emond
Pedro Jose Moreno Mengibar
(2021)
Preview abstract
Semi- and self-supervised training techniques have the potential to improve performance of speech recognition systems without additional transcribed speech data. In this work, we demonstrate the efficacy of two approaches to semi-supervision for automated speech recognition. The two approaches lever-age vast amounts of available unspoken text and untranscribed audio. First, we present factorized multilingual speech synthesis to improve data augmentation on unspoken text. Next, we present an online implementation of Noisy Student Training to incorporate untranscribed audio. We propose a modified Sequential MixMatch algorithm with iterative learning to learn from untranscribed speech. We demonstrate the compatibility of these techniques yielding a relative reduction of word error rate of up to 14.4% on the voice search task.
View details
RNN-Transducer with stateless prediction network
James Apfel
Rodrigo Cabrera
Xiaofeng Liu
ICASSP 2020, IEEE, pp. 7049-7053
Preview abstract
The RNN-Transducer (RNNT) outperforms classic Automatic Speech Recognition (ASR) systems when a large amount of supervised training data is available.
For low-resource languages, the RNNT models overfit, and can not directly take advantage of additional large text corpora as in classic ASR systems.
We focus on the prediction network of the RNNT, since it is believed to be analogous to the Language Model (LM) in the classic ASR systems.
We pre-train the prediction network with text-only data, which is not helpful.
Moreover, removing the recurrent layers from the prediction network, which makes the prediction network stateless, performs virtually as well as the original RNNT model, when using wordpieces.
The stateless prediction network does not depend on the previous output symbols, except the last one.
Therefore it simplifies the RNNT architectures and the inference.
Our results suggest that the RNNT prediction network does not function as the LM in classical ASR.
Instead, it merely helps the model align to the input audio, while the RNNT encoder and joint networks capture both the acoustic and the linguistic information.
View details
Preview abstract
Maximum Entropy (MaxEnt) Language Models (LMs) are powerful models
that can incorporate linguistic and non-linguistic contextual signals
in a unified framework, by optimizing a convex loss function.
In addition to their flexibility, a key advantage is their scalability,
in terms of model size and the amount of data that can be used during
training. We present the following two contributions to
MaxEnt training: (1) By leveraging smaller amounts of transcribed
data, we demonstrate that a MaxEnt LM trained on various
types of corpora can be easily adapted to better match the test
distribution of speech recognition; (2) A novel adaptive-training approach that efficiently
models multiple types of non-linguistic features in a
universal model.
We test the impact of these approaches on Google's state-of-the-art
speech recognizer for the task of voice-search transcription and
dictation. Training 10B parameter models utilizing a corpus
of up to 1T words, we show large reductions in word error
rate from adaptation across multiple languages. Also, human evaluations
show strong significant improvements on a wide range of domains from
using non-linguistic signals. For example, adapting to geographical
domains (e.g., US States and cities) affects about 4% of test
utterances, with 2:1 wins to loss ratio.
View details
Unsupervised Context Learning For Speech Recognition
Justin Scheiner
Spoken Language Technology (SLT) Workshop, IEEE (2016)
Preview abstract
It has been shown in the literature that automatic speech
recognition systems can greatly benefit from contextual in-
formation [ref]. The contextual information can be used to
simplify the search and improve recognition accuracy. The
types of useful contextual information can include the name
of the application the user is in, the contents on the user’s
phone screen, user’s location, a certain dialog state, etc.
Building a separate language model for each of these types
of context is not feasible due to limited resources or limited
amount of training data.
In this paper we describe an approach for unsupervised
learning of contextual information and automatic building of
contextual (biasing) models. Our approach can be used to
build a large number of small contextual models from a lim-
ited amount of available unsupervised training data. We de-
scribe how n-grams relevant for a particular context are au-
tomatically selected as well as how an optimal size of a final
contextual model built is chosen. Our experimental results
show great accuracy improvements for several types of con-
text.
View details
Bringing Contextual Information to Google Speech Recognition
Preview
Keith Hall
David Rybach
Pedro Moreno
Interspeech 2015, International Speech Communications Association