Global Normalization for Streaming Speech Recognition in a Modular Framework

David Rybach
Ke Wu
Michael Riley
arxiv (2022)
Google Scholar

Abstract

We introduce the Globally Normalized Autoregressive Transducer (GNAT) foraddressing the label bias problem in streaming speech recognition. Our solutionadmits a tractable exact computation of the denominator for the sequence-levelnormalization. Through theoretical and empirical results, we demonstrate thatby switching to a globally normalized model, the word error rate gap betweenstreaming and non-streaming speech-recognition models can be greatly reduced (bymore than 50% on the Librispeech dataset). This model is developed in a modularframework which encompasses all the common neural speech recognition models.The modularity of this framework enables controlled comparison of modellingchoices and creation of new models.