A Deliberation-based Joint Acoustic and Text Decoder

Sepand Mavandadi; Tara N Sainath; Kevin Hu; Zelin Wu

A Deliberation-based Joint Acoustic and Text Decoder

Sepand Mavandadi

Tara N Sainath

Kevin Hu

Zelin Wu

Proc. Interspeech 2021

Download Google Scholar

Abstract

We propose a new two-pass E2E speech recognition model that improves ASR performance by training on a combination of paired data and unpaired text data. Previously, the joint acoustic and text decoder (JATD) has shown promising results through the use of text data during model training and the recently introduced deliberation architecture has reduced recognition errors by leveraging first-pass decoding results. Our method, dubbed Deliberation-JATD, combines the spelling correcting abilities of deliberation with JATD’s use of unpaired text data to further improve performance. The proposed model produces substantial gains across multiple test sets, especially those focused on rare words, where it reduces word error rate (WER) by between 12% and 22.5% relative. This is done without increasing model size or requiring multi-stage training, making Deliberation-JATD an efficient candidate for on-device applications.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

A Deliberation-based Joint Acoustic and Text Decoder

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs