RNN-Transducer with stateless prediction network

Eugene Weinstein; James Apfel; Mohammadreza Ghodsi; Rodrigo Cabrera; Xiaofeng Liu

RNN-Transducer with stateless prediction network

Eugene Weinstein

James Apfel

Mohammadreza Ghodsi

Rodrigo Cabrera

Xiaofeng Liu

ICASSP 2020, IEEE, pp. 7049-7053

Download Google Scholar

Abstract

The RNN-Transducer (RNNT) outperforms classic Automatic Speech Recognition (ASR) systems when a large amount of supervised training data is available.
For low-resource languages, the RNNT models overfit, and can not directly take advantage of additional large text corpora as in classic ASR systems.

We focus on the prediction network of the RNNT, since it is believed to be analogous to the Language Model (LM) in the classic ASR systems.
We pre-train the prediction network with text-only data, which is not helpful.
Moreover, removing the recurrent layers from the prediction network, which makes the prediction network stateless, performs virtually as well as the original RNNT model, when using wordpieces.
The stateless prediction network does not depend on the previous output symbols, except the last one.
Therefore it simplifies the RNNT architectures and the inference.

Our results suggest that the RNNT prediction network does not function as the LM in classical ASR.
Instead, it merely helps the model align to the input audio, while the RNNT encoder and joint networks capture both the acoustic and the linguistic information.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

RNN-Transducer with stateless prediction network

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs