Alexander Gruenstein

Alexander Gruenstein

Alex Gruenstein works on mobile speech interfaces at Google. He holds a Ph.D. in Computer Science from MIT, as well as B.S and M.S. degrees from Stanford University in Symbolic Systems.

Research Areas

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract On-device end-to-end (E2E) models have shown improvementsover a conventional model on Search test sets in both quality, as measured by Word Error Rate (WER), and latency, measured by the time the result is finalized after the user stops speaking. However, the E2E model is trained on a small fraction of audio-text pairs compared to the 100 billion text utterances that a conventional language model (LM) is trained with. Thus E2E models perform poorly on rare words and phrases. In this paper, building upon the two-pass streaming Cascaded Encoder E2E model, we explore using a Hybrid Autoregressive Transducer (HAT) factorization to better integrate an on-device neural LM trained on text-only data. Furthermore, to further improve decoder latency we introduce a non-recurrent embedding decoder, in place of the typical LSTM decoder, into the Cascaded Encoder model. Overall, we present a streaming on-device model that incorporates an external neural LM and outperforms the conventional model in both search and rare-word quality, as well as latency, and is 318X smaller. View details
    A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency
    Ruoming Pang
    Antoine Bruguier
    Wei Li
    Raziel Alvarez
    Chung-Cheng Chiu
    David Garcia
    Kevin Hu
    Minho Jin
    Qiao Liang
    Cal Peyser
    David Rybach
    (June) Yuan Shangguan
    Yash Sheth
    Mirkó Visontai
    Yu Zhang
    Ding Zhao
    ICASSP (2020)
    Preview abstract Thus far, end-to-end (E2E) models have not shown to outperform state-of-the-art conventional models with respect to both quality, i.e., word error rate (WER), and latency, i.e., the time the hypothesis is finalized after the user stops speaking. In this paper, we develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses a conventional model in both quality and latency. On the quality side, we incorporate a large number of utterances across varied domains to increase acoustic diversity and the vocabulary seen by the model. We also train with accented English speech to make the model more robust to different pronunciations. In addition, given the increased amount of training data, we explore a varied learning rate schedule. On the latency front, we explore using the end-of-sentence decision emitted by the RNN-T model to close the microphone, and also introduce various optimizations to improve the speed of LAS rescoring. Overall, we find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model. For example, for the same latency, RNN-T+LAS obtains a 8% relative improvement in WER, while being more than 400-times smaller in model size. View details
    Preview abstract We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user, as part of a streaming speech recognition system. Delivering such a model presents numerous challenges: It should improve the performance when the input signal consists of overlapped speech, and must not hurt the speech recognition performance under all other acoustic conditions. Besides, this model must be tiny, fast, and perform inference in a streaming fashion, in order to have minimal impact on CPU, memory, battery and latency. We propose novel techniques to meet these multi-faceted requirements, including using a new asymmetric loss, and adopting adaptive runtime suppression strength. We also show that such a model can be quantized as a 8-bit integer model and run in realtime. View details
    Preview abstract This paper presents a novel dual-microphone speech enhancement algorithm to improve noise robustness of hotword (wake-word) detection as a special application of keyword spotting. It exploits two unique properties of hotwords: they are leading phrases of valid voice queries that we intend to respond and have short durations. Consequently an STFT-based adaptive noise cancellation method modified to use deferred filter coefficients is proposed to extract hotwords out from stereo noisy microphone signals. The new algorithm is tested with two considerably different neural hotword detectors. Both systems have significantly reduced the false-reject rate when background has strong TV noise. View details
    Preview abstract End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition. E2E models, however, present numerous challenges: In order to be truly useful, such models must decode speech utterances in a streaming fashion, in real time; they must be robust to the long tail of use cases; they must be able to leverage user-specific context (e.g., contact lists); and above all, they must be extremely accurate. In this work, we describe our efforts at building an E2E speech recognizer using a recurrent neural network transducer. In experimental evaluations, we find that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy in a number of evaluation categories. View details
    Multi-Microphone Adaptive Noise Cancellation for Robust Hotword Detection
    Yiteng Huang
    Turaj Zakizadeh Shabestary
    Li Wan
    Proc. InterSpeech 2019, pp. 1233-1237
    Preview abstract Recently we proposed a dual-microphone adaptive noise cancellation (ANC) algorithm with deferred filter coefficients for robust hotword detection in [1]. It exploits two unique hotword-related features: hotwords are the leading phrase of valid voice queries and they are short. These features allow us not to compute a speech-noise mask that is a common prerequisite for many multichannel speech enhancement approaches. This novel idea was found effective against strong and ambiguous speech-like TV noise. In this paper, we show that it can be generalized to support more than two microphones. The development is validated using re-recorded data with background TV noise from a 3-mic array. By adding one more microphone, the false reject (FR) rate can be further reduced relatively by 33.5%. View details
    A Cascade Architecture for Keyword Spotting on Mobile Devices
    Raziel Alvarez
    Chris Thornton
    Mohammadali Ghodrat
    31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA (2017)
    Preview abstract We present a cascade architecture for keyword spotting with speaker verification on mobile devices. By pairing a small computational footprint with specialized digital signal processing (DSP) chips, we are able to achieve low power consumption while continuously listening for a keyword. View details
    Preview abstract We describe a large vocabulary speech recognition system that is accurate, has low latency, and yet has a small enough memory and computational footprint to run faster than real-time on a Nexus 5 Android smartphone. We employ a quantized Long Short-Term Memory (LSTM) acoustic model trained with connectionist temporal classification (CTC) to directly predict phoneme targets, and further reduce its memory footprint using an SVD-based compression scheme. Additionally, we minimize our memory footprint by using a single language model for both dictation and voice command domains, constructed using Bayesian interpolation. Finally, in order to properly handle device-specific information, such as proper names and other context-dependent information, we inject vocabulary items into the decoder graph and bias the language model on-the-fly. Our system achieves 13.5% word error rate on an open-ended dictation task, running with a median speed that is seven times faster than real-time. View details