Alexander Gruenstein
Alex Gruenstein works on mobile speech interfaces at Google. He holds a Ph.D. in Computer Science from MIT, as well as B.S and M.S. degrees from Stanford University in Symbolic Systems.
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling
Rami Botros
Ruoming Pang
James Qin
Quoc-Nam Le-The
Anmol Gulati
Chung-Cheng Chiu
Emmanuel Guzman
Jiahui Yu
Qiao Liang
Wei Li
Yu Zhang
Interspeech (2021) (to appear)
Preview abstract
On-device end-to-end (E2E) models have shown improvementsover a conventional model on Search test sets in both quality, as measured by Word Error Rate (WER), and latency, measured by the time the result is finalized after the user stops speaking. However, the E2E model is trained on a small fraction of audio-text pairs compared to the 100 billion text utterances that a conventional language model (LM) is trained with. Thus E2E models perform poorly on rare words and phrases. In this paper, building upon the two-pass streaming Cascaded Encoder E2E model, we explore using a Hybrid Autoregressive Transducer (HAT) factorization to better integrate an on-device neural LM trained on text-only data. Furthermore, to further improve decoder latency we introduce a non-recurrent embedding decoder, in place of the typical LSTM decoder, into the Cascaded Encoder model. Overall, we present a streaming on-device model that incorporates an external neural LM and outperforms the conventional model in both search and rare-word quality, as well as latency, and is 318X smaller.
View details
A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency
Ruoming Pang
Antoine Bruguier
Wei Li
Raziel Alvarez
Chung-Cheng Chiu
David Garcia
Kevin Hu
Minho Jin
Qiao Liang
(June) Yuan Shangguan
Yash Sheth
Mirkó Visontai
Yu Zhang
Ding Zhao
ICASSP (2020)
Preview abstract
Thus far, end-to-end (E2E) models have not shown to outperform state-of-the-art conventional models with respect to both quality, i.e., word error rate (WER), and latency, i.e., the time the hypothesis is finalized after the user stops speaking. In this paper, we develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses a conventional model in both quality and latency. On the quality side, we incorporate a large number of utterances across varied domains to increase acoustic diversity and the vocabulary seen by the model. We also train with accented English speech to make the model more robust to different pronunciations. In addition, given the increased amount of training data, we explore a varied learning rate schedule. On the latency front, we explore using the end-of-sentence decision emitted by the RNN-T model to close the microphone, and also introduce various optimizations to improve the speed of LAS rescoring. Overall, we find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model. For example, for the same latency, RNN-T+LAS obtains a 8% relative improvement in WER, while being more than 400-times smaller in model size.
View details
VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition
Mert Saglam
Alan Chiao
Renjie Liu
Wei Li
Jason Pelecanos
Marily Nika
Interspeech 2020 (2020) (to appear)
Preview abstract
We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user, as part of a streaming speech recognition system. Delivering such a model presents numerous challenges: It should improve the performance when the input signal consists of overlapped speech, and must not hurt the speech recognition performance under all other acoustic conditions. Besides, this model must be tiny, fast, and perform inference in a streaming fashion, in order to have minimal impact on CPU, memory, battery and latency. We propose novel techniques to meet these multi-faceted requirements, including using a new asymmetric loss, and adopting adaptive runtime suppression strength. We also show that such a model can be quantized as a 8-bit integer model and run in realtime.
View details
Multi-Microphone Adaptive Noise Cancellation for Robust Hotword Detection
Yiteng Huang
Turaj Zakizadeh Shabestary
Li Wan
Proc. InterSpeech 2019, pp. 1233-1237
Preview abstract
Recently we proposed a dual-microphone adaptive noise cancellation (ANC) algorithm with deferred filter coefficients for robust hotword detection in [1]. It exploits two unique hotword-related features: hotwords are the leading phrase of valid voice queries and they are short. These features allow us not to compute a speech-noise mask that is a common prerequisite for many multichannel speech enhancement approaches. This novel idea was found effective against strong and ambiguous
speech-like TV noise. In this paper, we show that it can be generalized to support more than two microphones. The development is validated using re-recorded data with background TV noise from a 3-mic array. By adding one more microphone, the false reject (FR) rate can be further reduced relatively by 33.5%.
View details
STREAMING END-TO-END SPEECH RECOGNITION FOR MOBILE DEVICES
Raziel Alvarez
Ding Zhao
Ruoming Pang
Qiao Liang
Deepti Bhatia
Yuan Shangguan
ICASSP (2019)
Preview abstract
End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition. E2E models, however, present numerous challenges: In order to be truly useful, such models must decode speech utterances in a streaming fashion, in real time; they must be robust to the long tail of use cases; they must be able to leverage user-specific context (e.g., contact lists); and above all, they must be extremely accurate. In this work, we describe our efforts at building an E2E speech recognizer using a recurrent neural network transducer. In experimental evaluations, we find that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy in a number of evaluation categories.
View details
Preview abstract
This paper presents a novel dual-microphone speech enhancement algorithm to improve noise robustness of hotword (wake-word) detection as a special application of keyword spotting. It exploits two unique properties of hotwords: they are leading phrases of valid voice queries that we intend to respond and have short durations. Consequently an STFT-based adaptive noise cancellation method modified to use deferred filter coefficients is proposed to extract hotwords out from stereo noisy microphone signals. The new algorithm is tested with two considerably different neural hotword detectors. Both systems have significantly reduced the false-reject rate when background has strong TV noise.
View details
A Cascade Architecture for Keyword Spotting on Mobile Devices
Raziel Alvarez
Chris Thornton
Mohammadali Ghodrat
31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA (2017)
Preview abstract
We present a cascade architecture for keyword spotting with speaker verification on mobile devices. By pairing a small computational footprint with specialized digital signal processing (DSP) chips, we are able to achieve low power consumption while continuously listening for a keyword.
View details
Personalized Speech Recognition On Mobile Devices
Raziel Alvarez
Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2016)
Preview abstract
We describe a large vocabulary speech recognition system that is accurate, has low latency, and yet has a small enough memory and computational footprint to run faster than real-time on a Nexus 5 Android smartphone. We employ a quantized Long Short-Term Memory (LSTM) acoustic model trained with connectionist temporal classification (CTC) to directly predict phoneme targets, and further reduce its memory footprint using an SVD-based compression scheme. Additionally, we minimize our memory footprint by using a single language model for both dictation and voice command domains, constructed using Bayesian interpolation. Finally, in order to properly handle device-specific information, such as proper names and other context-dependent information, we inject vocabulary items into the decoder graph and bias the language model on-the-fly. Our system achieves 13.5% word error rate on an open-ended dictation task, running with a median speed that is seven times faster than real-time.
View details
On-Demand Language Model Interpolation for Mobile Speech Input
Brandon Ballinger
Johan Schalkwyk
Interspeech (2010), pp. 1812-1815
Preview abstract
Google offers several speech features on the Android mobile
operating system: search by voice, voice input to any text field, and an API for application developers. As a result, our speech recognition service must support a wide range of usage scenarios and speaking styles: relatively short search queries, addresses, business names, dictated SMS and e-mail messages, and a long tail of spoken input to any of the applications users may install. We present a method of on-demand language model interpolation in which contextual information about each utterance determines interpolation weights among a number of n-gram language models. On-demand interpolation results in an 11.2% relative reduction in WER compared to using a single language model to handle all traffic.
View details
A Self-Transcribing Speech Corpus: Collecting Continuous Speech with an Online Educational Game
City Browser: Developing a Conversational Automotive HMI
Jarrod Orszulak
Sean Liu
Shannon Roberts
Jeff Zabel
Bryan Reimer
Bruce Mehler
Stephanie Seneff
James Glass
and Joseph Coughlin.
Proc. of CHI (2009)
A Self-Labeling Speech Corpus: Collecting Spoken Words with an Online Educational Game
A Multimodal Home Entertainment Interface via a Mobile Device
Bo-June (Paul) Hsu
James Glass
Stephanie Seneff
Lee Hetherington
Scott Cyphers
Ibrahim Badr
Chao Wang
Sean Liu
Proc. of the ACL Workshop on Mobile Language Processing (2008)
Response-Based Confidence Annotation for Spoken Dialogue Systems
Proc. of the 9th SIGdial Workshop on Discourse and Dialogue (2008)
The WAMI Toolkit for Developing, Deploying, and Evaluating Web-Accessible Multimodal Interfaces
Ibrahim Badr
Proc. of 10th International Conference on Multimodal Interfaces (2008)
Meeting Structure Annotation
John Niekrasz
Matthew Purver
Recent Trends in Discourse and Dialogue, Springer (2008)
Releasing a Multimodal Dialogue System into the Wild: User Support Mechanisms
Stephanie Seneff
Proc. of the 8th SIGdial Workshop on Discourse and Dialogue (2007)
Context Sensitive Language Modeling for Large Sets of Proper Nouns in Multimodal Dialogue Systems
Stephanie Seneff
Proc. of IEEE/ACL Workshop on Spoken Language Technology (2006)
Scalable and Portable Web-based Multimodal Dialogue Interaction with Geographical Databases
NOMOS: A Semantic Web Software Framework for Annotation of Multimodal Corpora
John Niekrasz
Proc. of the 5th Conference on Language Resources and Evaluation (LREC 2006)
Context-Sensitive Statistical Language Modeling
Meeting Structure Annotation: Data and Tools
John Niekraz
Matthew Purver
Proc. of the 6th SIGdial Workshop on Disource and Dialogue (2005)
A General Purpose Architecture for Intelligent Tutoring Systems
Brady Clark
Oliver Lemon
Elizabeth Owen Bratt
John Fry
Stanley Peters
Heather Pon-Barry
Karl Schultz
Zack Thomsen-Gray
Pucktada Treeratpituk
Natural, Intelligent and Effective Interaction in Multimodal Dialogue Systems, Kluwer (2005)
Emotional Information Available from Videotapes vs Transcripts
Anna Liess
Wendy Ellis
Janine Giese-Davis
Mitch Golant
David Spiegel
Proceedings of the 25th Annual Meeting of the Society of Behavioral Medicine (2004)
Demo: A Multimodal Learning Interface for Sketch, Speak and Point creation of a Schedule Chart
Ed Kaiser
David Demirdjian
Xiaoguang Li
John Niekrasz
Matt Wesson
Sanjeev Kumar
Proceedings of the Sixth International Conference on Multimodal Interfaces (ICMI 2004)
Multi-Human Dialogue Understanding for Assisting Artifact-Producing Meetings
John Niekrasz
Lawrence Cavedon
Proceedings of the 20th International Conference on Computational Linguistics (COLING) (2004)
Using an Activity Model to Address Issues in Task-Oriented Dialogue Interaction Over Extended Periods
Lawrence Cavedon
Proceedings of AAAI Spring Symposium on Interaction Between Humans and Autonomous Systems over Extended Periods (2004)
Managing uncertainty in dialogue information state for real time understanding of multi-human meeting dialogues
Lawrence Cavedon
John Niekrasz
Dominic Widdows
Stanley Peters
Proceedings of the 8th Workshop on Formal Semantics and Pragmatics of Dialogue (Catalog) (2004)
Multithreaded context for robust conversational interfaces: context-sensitive speech recognition and interpretation of corrective fragments
Oliver Lemon
ACM Transactions on Computer-Human Interaction, vol. 11(3) (2004), pp. 241-267
Targeted Help for Spoken Dialogue Systems: Intelligent Feedback Improves Naive User's Performance
Beth Ann Hockey
Oliver Lemon
Ellen Campana
Laura Hiatt
Gregory Aist
James Hieronymus
John Dowding
Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL) (2003)
Generation of collaborative spoken dialogue contributions in dynamic task environment
Oliver Lemon
Randolph Gullett
Alexis Battle
Laura Hiatt
Stanley Peters
Working Papers of the 2003 AAAI Spring Symposium on Natural Language Generation in Spoken and Written Dialogue, {AAAI} Press, pp. 85-90
An information state approach in a multi-modal dialogue system for human-robot conversation
Oliver Lemon
Anne Bracy
Stanley Peters
Perspectives on Dialogue in the new Millenium, John Benjamins (2003), pp. 229-242
Collaborative Activities and Multi-tasking in Dialogue Systems
Oliver Lemon
Stanley Peters
Traitment automatique des langues, vol. 43(2) (2002), pp. 131-154
Collaborative Dialogue for Controlling Autonomous Systems
Oliver Lemon
Lawrence Cavedon
Stanley Peters
Proccedings of the AAAI Fall Symposium (2002)
Multi-tasking and Collaborative Activities in Dialogue Systems
Oliver Lemon
Alexis Battle
Stanley Peters
Proceedings of the 3rd SIGdial Workshop on Discourse and Dialogue (2002), pp. 113-124
Information States in a Multi-modal Dialogue System for Human-Robot Conversation
Oliver Lemon
Anne Bracy
Stanley Peters
Proceedings of the 5th Workshop on Formal Semantics and Pragmatics of Dialogue (Bi-Dialog 2001), pp. 57 - 67
A Multi-Modal Dialogue System for Human-Robot Conversation
Oliver Lemon
Anne Bracy
Stanley Peters
Proceedings of the Scond Meeting of the North American Chapter of the Association for Computational Linguistics NAACL (2001)
The WITAS Multi-Modal Dialogue System I
Oliver Lemon
Anne Bracy
Stanley Peters
7th European Conference on Speech Communication and Technology (EuroSpeech) (2001)