Tongzhou Chen
Research Areas
Authored Publications
Sort By
Multilingual Second-Pass Rescoring for Automatic Speech RecognitionSystems
Pedro Moreno Mengibar
ICASSP (2022)
Preview abstract
Second-pass rescoring is a well known technique to improve the performance of Automatic Speech Recognition (ASR) systems. Neural oracle search (NOS), which selects the most likely hypothesis from N-best hypothesis list by integrating in-formation from multiple sources, such as the input acoustic representations, N-best hypotheses, additional first-pass statistics,and unpaired textual information through an external language model, has shown success in re-scoring for RNN-T first-pass models. Multilingual first-pass speech recognition models of-ten outperform their monolingual counterparts when trained on related or low-resource languages. In this paper, we investigate making the second-pass model multilingual and apply rescoring on a multilingual first-pass. We conduct experiments on Nordic languages including Danish, Dutch, Finnish, Norwegian and Swedish.
View details
On Weight Interpolation of the Hybrid Autoregressive Transducer Model
David Rybach
Interspeech 2022, Interspeech 2022 (2022) (to appear)
Preview abstract
This paper explores ways to improve a two-pass speech recognition system when the first-pass
is hybrid autoregressive transducer model and the second-pass is a neural language model.
The main focus is on the scores provided by each of these models, their quantitative analysis,
how to improve them and the best way to integrate them with the objective of better recognition
accuracy. Several analysis are presented to show the importance of the choice of the
integration weights for combining the first-pass and the second-pass scores. A sequence level weight
estimation model along with four training criteria are proposed which allow adaptive integration
of the scores per acoustic sequence.
The effectiveness of this algorithm is demonstrated by constructing and analyzing
models on the Librispeech data set.
View details
Mixture Model Attention: Flexible Streaming and Non-Streaming Automatic Speech Recognition
Pedro Jose Moreno Mengibar
Proceedings of Interspeech, 2021 (to appear)
Preview abstract
Streaming automatic speech recognition (ASR) hypothesizes words as soon as the input audio arrives, whereas non-streaming ASR can potentially wait for the completion of the entire utterance to hypothesize words.
Streaming and non-streaming ASR systems have typically used different acoustic encoders.
Recent work has attempted to unify them by either jointly training a fixed stack of streaming and non-streaming layers or using knowledge distillation during training to ensure consistency between the streaming and non-streaming predictions.
We propose mixture model (MiMo) attention as a simpler and theoretically-motivated alternative that replaces only the attention mechanism, requires no change to the training loss, and allows greater flexibility of switching between streaming and non-streaming mode during inference.
Our experiments on the public Librispeech data set and a few Indic language data sets show that MiMo attention endows a single ASR model with the ability to operate in both streaming and non-streaming modes without any overhead and without significant loss in accuracy compared to separately-trained streaming and non-streaming models.
View details