Endpoint detection using grid long short-term memory networks for streaming speech recognition

Shuo-yiin Chang; Bo Li; Tara Sainath; Gabor Simko; Carolina Parada

Endpoint detection using grid long short-term memory networks for streaming speech recognition

Shuo-yiin Chang

Bo Li

Tara Sainath

Gabor Simko

Carolina Parada

In Proc. Interspeech 2017

Download Google Scholar

Abstract

The task of endpointing is to determine when the user has finished speaking, which is important for interactive speech applications such as voice search and Google Home. In this paper, we propose a GLDNN-based (grid long short-term memory, deep neural network) endpointer model and show that it provides significant improvements over a state-of-the-art CLDNN (convolutional, long short-term memory, deep neural networks) model. Specifically, we replace the convolution layer with a grid LSTM layer that models both spectral and temporal variations through recurrent connections. Results show that the GLDNN achieves 39% relative improvement in false alarm rate at a fixed false reject rate of 2%, and reduces median latency by 11%. We also include detailed experiments investigating why grid LSTMs offer better performance than CLDNNs. Analysis reveals that the recurrent connection along the frequency axis is an important factor that greatly contributes to the performance of grid LSTMs, especially in the presence of background noise. Finally, we also show that multichannel input further increases robustness to background speech. Overall, we achieved 16% (100 ms) endpointer latency improvement relative to our previous best model.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Endpoint detection using grid long short-term memory networks for streaming speech recognition

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs