Google Research

An Attention-Based Joint Acoustic and Text on-Device End-To-End Model

Abstract

Recently, we introduced a 2-pass on-device E2E model, which runs RNN-T in the first-pass and then rescores/redecodes this with a LAS decoder. This on-device model was similar in performance compared to a state-of-the-art conventional model. However, like many E2E models it is trained on supervised audio-text pairs and thus did poorly on rare-words compared to a conventional model trained on a much larger text-corpora. In this work, we introduce a joint acoustic and text-only decoder (JATD) into the LAS decoder, which allows the LAS decoder to be trained on a much larger text-corporate. We find that the JATD model provides between a 3-10\% relative improvement in WER compared to a LAS decoder trained on only supervised audio-text pairs across a variety of proper noun test sets.

Research Areas

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work