An Attention-Based Joint Acoustic and Text on-Device End-To-End Model

Tara N Sainath; Ruoming Pang; Ron J. Weiss; Yanzhang He; Chung-Cheng Chiu; Trevor Strohman

An Attention-Based Joint Acoustic and Text on-Device End-To-End Model

Tara N Sainath

Ruoming Pang

Ron J. Weiss

Yanzhang He

Chung-Cheng Chiu

Trevor Strohman

ICASSP (2020)

Download Google Scholar

Abstract

Recently, we introduced a 2-pass on-device E2E model, which runs RNN-T in the first-pass and then rescores/redecodes this with a LAS decoder. This on-device model was similar in performance compared to a state-of-the-art conventional model. However, like many E2E models it is trained on supervised audio-text pairs and thus did poorly on rare-words compared to a conventional model trained on a much larger text-corpora. In this work, we introduce a joint acoustic and text-only decoder (JATD) into the LAS decoder, which allows the LAS decoder to be trained on a much larger text-corporate. We find that the JATD model provides between a 3-10\% relative improvement in WER compared to a LAS decoder trained on only supervised audio-text pairs across a variety of proper noun test sets.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

An Attention-Based Joint Acoustic and Text on-Device End-To-End Model

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs