Unified End-to-End Speech Recognition and Endpointing for Fast and Efficient Speech Systems

Shaan Bijwadia; Shuo-yiin Chang; Tara N Sainath; Bo Li; Chao Zhang; Yanzhang he

Unified End-to-End Speech Recognition and Endpointing for Fast and Efficient Speech Systems

Shaan Bijwadia

Shuo-yiin Chang

Tara N Sainath

Bo Li

Chao Zhang

Yanzhang he

IEEE Spoken Language Technology Workshop (2022)

Download Google Scholar

Abstract

Automatic speech recognition (ASR) systems typically rely on an external endpointer (EP) model to identify speech boundaries. This EP model strongly affects latency, but is subject to computational constraints, which limits prediction accuracy. We propose a method to jointly train the ASR and EP tasks in a single end-to-end (E2E) multitask model, improving EP quality by optionally leveraging information from the ASR audio encoder. We introduce a "switch" connection, which trains the EP to consume either the audio frames directly or low-level latent representations from the ASR model. This allows flexibility during inference to produce a low-cost prediction or a higher quality prediction if ASR computation is ongoing. We present results on a voice search test set showing that, compared to separate single-task models, this approach reduces median endpoint latency by 130ms (33.3% reduction), and 90th percentile latency by 160ms (22.2% reduction), without regressing word-error rate. For continuous recognition, WER improves by 10.6% (relative).

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Unified End-to-End Speech Recognition and Endpointing for Fast and Efficient Speech Systems

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs