Neeraj Gaur
Authored Publications
Sort By
Preview abstract
This paper discusses a method to inject text when training an ASR system without the need for up sampling the text sequence to match the length of the speech sequence.
View details
Multilingual Second-Pass Rescoring for Automatic Speech RecognitionSystems
Pedro Moreno Mengibar
ICASSP (2022)
Preview abstract
Second-pass rescoring is a well known technique to improve the performance of Automatic Speech Recognition (ASR) systems. Neural oracle search (NOS), which selects the most likely hypothesis from N-best hypothesis list by integrating in-formation from multiple sources, such as the input acoustic representations, N-best hypotheses, additional first-pass statistics,and unpaired textual information through an external language model, has shown success in re-scoring for RNN-T first-pass models. Multilingual first-pass speech recognition models of-ten outperform their monolingual counterparts when trained on related or low-resource languages. In this paper, we investigate making the second-pass model multilingual and apply rescoring on a multilingual first-pass. We conduct experiments on Nordic languages including Danish, Dutch, Finnish, Norwegian and Swedish.
View details
Mixture of Informed Experts for Multilingual Speech Recognition
Brian Farris
Pedro Jose Moreno Mengibar
Yun Zhu
ICASSP 2021, IEEE International Conference on Acoustics, Speech and Signal Processing (to appear)
Preview abstract
Multilingual speech recognition models are capable of recognizing speech in multiple different languages. When trained on related or low-resource languages, these models often outperform their monolingual counterparts. Similar to other forms of multi-task models, when the group of languages are unrelated, or when large amounts of training data is available, multilingual models can suffer from performance loss. We investigate the use of a mixture-of-expert approach to assign per-language parameters in the model to increase network capacity in a structured fashion. We introduce a novel variant of this approach, 'informed experts', which attempts to tackle inter-task conflicts by eliminating gradients from other tasks in the these task-specific parameters. We conduct experiments on a real-world task on English, French and four dialects of Arabic to show the effectiveness of our approach.
View details
Self-Adaptive Distillation for Multilingual Speech Recognition: Leveraging Student Independence
Brian Farris
Pedro Jose Moreno Mengibar
Yun Zhu
Interspeech 2021 (to appear)
Preview abstract
With a large population of the world speaking more than one language, multilingual automatic speech recognition (ASR) has gained popularity in the recent years. While lower resource languages can benefit from quality improvements in a multilingual ASR system, including unrelated or higher resource languages in the mix often results in performance degradation. In this paper, we propose distilling from multiple teachers, with each language using its best teacher during training, to tackle this problem. We introduce self-adaptive distillation, a novel technique for automatic weighting of the distillation loss that uses the student/teachers confidences. We analyze the effectiveness of the proposed techniques on two real world use-cases and show that the performance of the multilingual ASR models can be improved by up to 11.5% without any increase in model capacity. Furthermore, we show that when our methods are combined with increase in model capacity, we can achieve quality gains of up to 20.7%.
View details
Multilingual Speech Recognition with Self-Attention Structured Parameterization
Yun Zhu
Brian Farris
Hainan Xu
Han Lu
Pedro Jose Moreno Mengibar
Qian Zhang
Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, ISCA
Preview abstract
Multilingual automatic speech recognition systems can transcribe utterances from different languages. These systems are attractive from different perspectives: they can provide quality improvements, specially for lower resource languages, and simplify the training and deployment procedure. End-to-end speech recognition has further simplified multilingual modeling as one model, instead of several components of a classical system, have to be unified. In this paper, we investigate a streamable end-to-end multilingual system based on the Transformer Transducer. We propose several techniques for adapting the self-attention architecture based on the language id. We analyze the trade-offs of each method with regards to quality gains and number of additional parameters introduced. We conduct experiments in a real-world task consisting of five languages. Our experimental results demonstrate $\sim$10\% and $\sim$15\% relative gain over the baseline multilingual model.
View details
Leveraging Language ID in Multilingual End-to-End Speech Recognition
Delia Qu
Pedro Jose Moreno Mengibar
IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019 (2019)
Preview abstract
Multilingual speech recognition models are capable of recognizing speech in multiple different languages. Depending on the amount of training data, and the relatedness of languages, these models can outperform their monolingual counterparts. However, the performance of these models heavily relies on an externally provided language-id which is used to augment the input features or modulate the neural network's per-layer outputs using a language-gate. In this paper, we introduce a novel technique for inferring the language-id in a streaming fashion using the RNN-T loss that eliminates reliance on knowing the utterance's language. We conduct experiments on two sets of languages, arabic and nordic, and show the effectiveness of our approach.
View details
From audio to semantics: Approaches to end-to-end spoken language understanding
Galen Chuang
Pedro Jose Moreno Mengibar
Delia Qu
Spoken Language Technology Workshop (SLT), 2018 IEEE
Preview abstract
Conventional spoken language understanding systems consist of two main components: an automatic speech recognition module that converts audio to text, and a natural language understanding module that transforms the resulting text (or top N hypotheses) into a set of intents and arguments. These modules are typically optimized independently. In this paper, we formulate audio to semantic understanding as a sequence-to-sequence problem. We propose and compare various encoder-decoder based approaches that optimizes both modules jointly, in an end-to-end manner. We evaluate these methods on a real-world task. Our results show that having an intermediate text representation while jointly optimizing the full system improves accuracy of prediction.
View details