Transducer-Based Streaming Deliberation For A Cascaded Encoder Model
Abstract
Previous research on deliberation networks has achieved excellent recognition quality. The attention decoder based deliberation models often works as a rescorer to improve first-pass recognition results, and often requires the full first-pass hypothesis for second-pass deliberation. In this work, we propose a streaming transducer-based deliberation model. The joint network of a transducer decoder often consists of inputs from the encoder and the prediction network. We propose to use attention to the first-pass text hypotheses as the third input to the joint network. The proposed transducer based deliberation model naturally streams, making it more desirable for on-device applications. We also show that the model improves rare word recognition, with relative WER reductions ranging from 3.6% to 10.4% for a variety of test sets. Our model does not use any additional text data for training.