Jump to Content
Alexander Gruenstein

Alexander Gruenstein

Alex Gruenstein works on mobile speech interfaces at Google. He holds a Ph.D. in Computer Science from MIT, as well as B.S and M.S. degrees from Stanford University in Symbolic Systems.

Research Areas

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract On-device end-to-end (E2E) models have shown improvementsover a conventional model on Search test sets in both quality, as measured by Word Error Rate (WER), and latency, measured by the time the result is finalized after the user stops speaking. However, the E2E model is trained on a small fraction of audio-text pairs compared to the 100 billion text utterances that a conventional language model (LM) is trained with. Thus E2E models perform poorly on rare words and phrases. In this paper, building upon the two-pass streaming Cascaded Encoder E2E model, we explore using a Hybrid Autoregressive Transducer (HAT) factorization to better integrate an on-device neural LM trained on text-only data. Furthermore, to further improve decoder latency we introduce a non-recurrent embedding decoder, in place of the typical LSTM decoder, into the Cascaded Encoder model. Overall, we present a streaming on-device model that incorporates an external neural LM and outperforms the conventional model in both search and rare-word quality, as well as latency, and is 318X smaller. View details
    Preview abstract Thus far, end-to-end (E2E) models have not shown to outperform state-of-the-art conventional models with respect to both quality, i.e., word error rate (WER), and latency, i.e., the time the hypothesis is finalized after the user stops speaking. In this paper, we develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses a conventional model in both quality and latency. On the quality side, we incorporate a large number of utterances across varied domains to increase acoustic diversity and the vocabulary seen by the model. We also train with accented English speech to make the model more robust to different pronunciations. In addition, given the increased amount of training data, we explore a varied learning rate schedule. On the latency front, we explore using the end-of-sentence decision emitted by the RNN-T model to close the microphone, and also introduce various optimizations to improve the speed of LAS rescoring. Overall, we find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model. For example, for the same latency, RNN-T+LAS obtains a 8% relative improvement in WER, while being more than 400-times smaller in model size. View details
    Preview abstract We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user, as part of a streaming speech recognition system. Delivering such a model presents numerous challenges: It should improve the performance when the input signal consists of overlapped speech, and must not hurt the speech recognition performance under all other acoustic conditions. Besides, this model must be tiny, fast, and perform inference in a streaming fashion, in order to have minimal impact on CPU, memory, battery and latency. We propose novel techniques to meet these multi-faceted requirements, including using a new asymmetric loss, and adopting adaptive runtime suppression strength. We also show that such a model can be quantized as a 8-bit integer model and run in realtime. View details
    Preview abstract This paper presents a novel dual-microphone speech enhancement algorithm to improve noise robustness of hotword (wake-word) detection as a special application of keyword spotting. It exploits two unique properties of hotwords: they are leading phrases of valid voice queries that we intend to respond and have short durations. Consequently an STFT-based adaptive noise cancellation method modified to use deferred filter coefficients is proposed to extract hotwords out from stereo noisy microphone signals. The new algorithm is tested with two considerably different neural hotword detectors. Both systems have significantly reduced the false-reject rate when background has strong TV noise. View details
    Multi-Microphone Adaptive Noise Cancellation for Robust Hotword Detection
    Yiteng Huang
    Turaj Zakizadeh Shabestary
    Li Wan
    Proc. InterSpeech 2019, pp. 1233-1237
    Preview abstract Recently we proposed a dual-microphone adaptive noise cancellation (ANC) algorithm with deferred filter coefficients for robust hotword detection in [1]. It exploits two unique hotword-related features: hotwords are the leading phrase of valid voice queries and they are short. These features allow us not to compute a speech-noise mask that is a common prerequisite for many multichannel speech enhancement approaches. This novel idea was found effective against strong and ambiguous speech-like TV noise. In this paper, we show that it can be generalized to support more than two microphones. The development is validated using re-recorded data with background TV noise from a 3-mic array. By adding one more microphone, the false reject (FR) rate can be further reduced relatively by 33.5%. View details
    Preview abstract End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition. E2E models, however, present numerous challenges: In order to be truly useful, such models must decode speech utterances in a streaming fashion, in real time; they must be robust to the long tail of use cases; they must be able to leverage user-specific context (e.g., contact lists); and above all, they must be extremely accurate. In this work, we describe our efforts at building an E2E speech recognizer using a recurrent neural network transducer. In experimental evaluations, we find that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy in a number of evaluation categories. View details
    A Cascade Architecture for Keyword Spotting on Mobile Devices
    Raziel Alvarez
    Chris Thornton
    Mohammadali Ghodrat
    31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA (2017)
    Preview abstract We present a cascade architecture for keyword spotting with speaker verification on mobile devices. By pairing a small computational footprint with specialized digital signal processing (DSP) chips, we are able to achieve low power consumption while continuously listening for a keyword. View details
    Preview abstract We describe a large vocabulary speech recognition system that is accurate, has low latency, and yet has a small enough memory and computational footprint to run faster than real-time on a Nexus 5 Android smartphone. We employ a quantized Long Short-Term Memory (LSTM) acoustic model trained with connectionist temporal classification (CTC) to directly predict phoneme targets, and further reduce its memory footprint using an SVD-based compression scheme. Additionally, we minimize our memory footprint by using a single language model for both dictation and voice command domains, constructed using Bayesian interpolation. Finally, in order to properly handle device-specific information, such as proper names and other context-dependent information, we inject vocabulary items into the decoder graph and bias the language model on-the-fly. Our system achieves 13.5% word error rate on an open-ended dictation task, running with a median speed that is seven times faster than real-time. View details
    Unsupervised Testing Strategies for ASR
    Brian Strope
    Doug Beeferman
    Xin Lei
    Interspeech 2011, pp. 1685-1688
    Preview
    Preview abstract Google offers several speech features on the Android mobile operating system: search by voice, voice input to any text field, and an API for application developers. As a result, our speech recognition service must support a wide range of usage scenarios and speaking styles: relatively short search queries, addresses, business names, dictated SMS and e-mail messages, and a long tail of spoken input to any of the applications users may install. We present a method of on-demand language model interpolation in which contextual information about each utterance determines interpolation weights among a number of n-gram language models. On-demand interpolation results in an 11.2% relative reduction in WER compared to using a single language model to handle all traffic. View details
    A Self-Labeling Speech Corpus: Collecting Spoken Words with an Online Educational Game
    Andrew Sutherland
    Interspeech (2009)
    City Browser: Developing a Conversational Automotive HMI
    Jarrod Orszulak
    Sean Liu
    Shannon Roberts
    Jeff Zabel
    Bryan Reimer
    Bruce Mehler
    Stephanie Seneff
    James Glass
    and Joseph Coughlin.
    Proc. of CHI (2009)
    A Self-Transcribing Speech Corpus: Collecting Continuous Speech with an Online Educational Game
    Andrew Sutherland
    SLaTE (2009)
    The WAMI Toolkit for Developing, Deploying, and Evaluating Web-Accessible Multimodal Interfaces
    Ibrahim Badr
    Proc. of 10th International Conference on Multimodal Interfaces (2008)
    Response-Based Confidence Annotation for Spoken Dialogue Systems
    Proc. of the 9th SIGdial Workshop on Discourse and Dialogue (2008)
    Meeting Structure Annotation
    John Niekrasz
    Matthew Purver
    Recent Trends in Discourse and Dialogue, Springer (2008)
    A Multimodal Home Entertainment Interface via a Mobile Device
    Bo-June (Paul) Hsu
    James Glass
    Stephanie Seneff
    Lee Hetherington
    Scott Cyphers
    Ibrahim Badr
    Chao Wang
    Sean Liu
    Proc. of the ACL Workshop on Mobile Language Processing (2008)
    Releasing a Multimodal Dialogue System into the Wild: User Support Mechanisms
    Stephanie Seneff
    Proc. of the 8th SIGdial Workshop on Discourse and Dialogue (2007)
    Scalable and Portable Web-based Multimodal Dialogue Interaction with Geographical Databases
    Stephanie Seneff
    Chao Wang
    Interspeech (2006)
    Context Sensitive Language Modeling for Large Sets of Proper Nouns in Multimodal Dialogue Systems
    Stephanie Seneff
    Proc. of IEEE/ACL Workshop on Spoken Language Technology (2006)
    NOMOS: A Semantic Web Software Framework for Annotation of Multimodal Corpora
    John Niekrasz
    Proc. of the 5th Conference on Language Resources and Evaluation (LREC 2006)
    Context-Sensitive Statistical Language Modeling
    Chao Wang
    Stephanie Seneff
    Interspeech (2005), pp. 17-20
    A General Purpose Architecture for Intelligent Tutoring Systems
    Brady Clark
    Oliver Lemon
    Elizabeth Owen Bratt
    John Fry
    Stanley Peters
    Heather Pon-Barry
    Karl Schultz
    Zack Thomsen-Gray
    Pucktada Treeratpituk
    Natural, Intelligent and Effective Interaction in Multimodal Dialogue Systems, Kluwer (2005)
    Meeting Structure Annotation: Data and Tools
    John Niekraz
    Matthew Purver
    Proc. of the 6th SIGdial Workshop on Disource and Dialogue (2005)
    Managing uncertainty in dialogue information state for real time understanding of multi-human meeting dialogues
    Lawrence Cavedon
    John Niekrasz
    Dominic Widdows
    Stanley Peters
    Proceedings of the 8th Workshop on Formal Semantics and Pragmatics of Dialogue (Catalog) (2004)
    Demo: A Multimodal Learning Interface for Sketch, Speak and Point creation of a Schedule Chart
    Ed Kaiser
    David Demirdjian
    Xiaoguang Li
    John Niekrasz
    Matt Wesson
    Sanjeev Kumar
    Proceedings of the Sixth International Conference on Multimodal Interfaces (ICMI 2004)
    Emotional Information Available from Videotapes vs Transcripts
    Anna Liess
    Wendy Ellis
    Janine Giese-Davis
    Mitch Golant
    David Spiegel
    Proceedings of the 25th Annual Meeting of the Society of Behavioral Medicine (2004)
    Using an Activity Model to Address Issues in Task-Oriented Dialogue Interaction Over Extended Periods
    Lawrence Cavedon
    Proceedings of AAAI Spring Symposium on Interaction Between Humans and Autonomous Systems over Extended Periods (2004)
    Multi-Human Dialogue Understanding for Assisting Artifact-Producing Meetings
    John Niekrasz
    Lawrence Cavedon
    Proceedings of the 20th International Conference on Computational Linguistics (COLING) (2004)
    Multithreaded context for robust conversational interfaces: context-sensitive speech recognition and interpretation of corrective fragments
    Oliver Lemon
    ACM Transactions on Computer-Human Interaction, vol. 11(3) (2004), pp. 241-267
    Generation of collaborative spoken dialogue contributions in dynamic task environment
    Oliver Lemon
    Randolph Gullett
    Alexis Battle
    Laura Hiatt
    Stanley Peters
    Working Papers of the 2003 AAAI Spring Symposium on Natural Language Generation in Spoken and Written Dialogue, {AAAI} Press, pp. 85-90
    An information state approach in a multi-modal dialogue system for human-robot conversation
    Oliver Lemon
    Anne Bracy
    Stanley Peters
    Perspectives on Dialogue in the new Millenium, John Benjamins (2003), pp. 229-242
    Targeted Help for Spoken Dialogue Systems: Intelligent Feedback Improves Naive User's Performance
    Beth Ann Hockey
    Oliver Lemon
    Ellen Campana
    Laura Hiatt
    Gregory Aist
    James Hieronymus
    John Dowding
    Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL) (2003)
    Collaborative Dialogue for Controlling Autonomous Systems
    Oliver Lemon
    Lawrence Cavedon
    Stanley Peters
    Proccedings of the AAAI Fall Symposium (2002)
    Multi-tasking and Collaborative Activities in Dialogue Systems
    Oliver Lemon
    Alexis Battle
    Stanley Peters
    Proceedings of the 3rd SIGdial Workshop on Discourse and Dialogue (2002), pp. 113-124
    Collaborative Activities and Multi-tasking in Dialogue Systems
    Oliver Lemon
    Stanley Peters
    Traitment automatique des langues, vol. 43(2) (2002), pp. 131-154
    Information States in a Multi-modal Dialogue System for Human-Robot Conversation
    Oliver Lemon
    Anne Bracy
    Stanley Peters
    Proceedings of the 5th Workshop on Formal Semantics and Pragmatics of Dialogue (Bi-Dialog 2001), pp. 57 - 67
    A Multi-Modal Dialogue System for Human-Robot Conversation
    Oliver Lemon
    Anne Bracy
    Stanley Peters
    Proceedings of the Scond Meeting of the North American Chapter of the Association for Computational Linguistics NAACL (2001)
    The WITAS Multi-Modal Dialogue System I
    Oliver Lemon
    Anne Bracy
    Stanley Peters
    7th European Conference on Speech Communication and Technology (EuroSpeech) (2001)