Anshuman Tripathi
Research Areas
Authored Publications
Sort By
End-to-end audio-visual speech recognition for overlapping speech
INTERSPEECH 2021: Conference of the International Speech Communication Association
Preview abstract
This paper investigates an end-to-end modeling approach for ASR that explicitly deals with scenarios where there are overlapping speech utterances from multiple talkers.
The approach assumes the availability of both audio signals and video signals in the form of continuous mouth-tracks aligned with speech for overlapping speakers.
This work extends previous work on audio-only multi-talker ASR applied to two party conversations in a call center application. It also extends work on end-to-end audio-visual (A/V) ASR applied to A/V YouTube (YT) Confidence Island utterances. It is shown that incorporating attention weighted combination of visual features in A/V multi-talker RNNT models significantly improves speaker disambiguation in ASR on overlapping speech. A 17% reduction in WER was observed for A/V multi-talker models relative to audio-only multi-talker models on a simulated A/V overlapped speech corpus.
View details
Turn-To-Diarize: Online speaker diarization constrained by transformer transducer speaker turn detection
Han Lu
Wei Xia
Submitted to ICASSP 2022, IEEE (2021)
Preview abstract
In this paper, we present a novel speaker diarization system for streaming on-device applications. In this system, we use a transformer transducer to detect the speaker turns, represent each speaker turn by a speaker embedding, then cluster these embeddings with constraints from the detected speaker turns.
Compared with conventional clustering-based diarization systems, our system largely reduces the computational cost of clustering due to the sparsity of speaker turns. Unlike other supervised speaker diarization systems which require annotations of timestamped speaker labels, our system only requires including speaker turn tokens during the transcribing process, which largely reduces the human efforts involved in data collection.
View details
Multilingual Speech Recognition with Self-Attention Structured Parameterization
Yun Zhu
Brian Farris
Hainan Xu
Han Lu
Pedro Jose Moreno Mengibar
Qian Zhang
Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, ISCA
Preview abstract
Multilingual automatic speech recognition systems can transcribe utterances from different languages. These systems are attractive from different perspectives: they can provide quality improvements, specially for lower resource languages, and simplify the training and deployment procedure. End-to-end speech recognition has further simplified multilingual modeling as one model, instead of several components of a classical system, have to be unified. In this paper, we investigate a streamable end-to-end multilingual system based on the Transformer Transducer. We propose several techniques for adapting the self-attention architecture based on the language id. We analyze the trade-offs of each method with regards to quality gains and number of additional parameters introduced. We conduct experiments in a real-world task consisting of five languages. Our experimental results demonstrate $\sim$10\% and $\sim$15\% relative gain over the baseline multilingual model.
View details
Domain Adaptation Using Factorized Hidden Layer for Robust Automatic Speech Recognition
Interspeech (2018), pp. 892-896
Preview abstract
Domain robustness is a challenging problem for automatic speech recognition (ASR). In this paper, we consider speech data collected for different applications as separate domains and investigate the robustness of acoustic models trained on multi-domain data on unseen domains. Specifically, we use Factorized Hidden Layer (FHL) as a compact low-rank representation to adapt a multi-domain ASR system to unseen domains. Experimental results on two unseen domains show that FHL is a more effective adaptation method compared to selectively fine-tuning part of the network, without dramatically increasing the model parameters. Furthermore, we found that using singular value decomposition to initialize the low-rank bases of an FHL model leads to a faster convergence and improved performance.
View details
TEMPORAL MODELING USING DILATED CONVOLUTION AND GATING FOR VOICE-ACTIVITY-DETECTION
Gabor Simko
Aäron van den Oord
ICASSP 2018
Preview abstract
Voice-activity-detection (VAD) is the task of predicting where in
the utterance is speech versus background noise. It is an important
first step to determine when to open the microphone (i.e., start-of-
speech) and close the microphone (i.e., end-of-speech) for streaming
speech recognition applications such as Voice Search. Long short-
term memory neural networks (LSTMs) have been a popular archi-
tecture for sequential modeling for acoustic signals, and have been
successfully used for many VAD applications. However, it has been
observed that LSTMs suffer from state saturation problems when the
utterance is long (i.e., for voice dictation tasks), and thus requires the
LSTM state to be periodically reset. In this paper, we propse an alter-
native architecture that does not suffer from saturation problems by
modeling temporal variations through a stateless dilated convolution
neural network (CNN). The proposed architecture differs from con-
ventional CNNs in three respects (1) dilated causal convolution, (2)
gated activations and (3) residual connections. Results on a Google Voice
Typing task shows that the proposed architecture achieves 14% rela-
tive FA improvement at a FR of 1% over state-of-the-art LSTMs for
VAD task. We also include detailed experiments investigating the
factors that distinguish the proposed architecture from conventional
convolution.
View details
TOWARD DOMAIN-INVARIANT SPEECH RECOGNITION VIA LARGE SCALE TRAINING
Mohamed (Mo) Elfeky
SLT, IEEE (2018)
Preview abstract
Current state-of-the-art automatic speech recognition systems are trained to work in specific ‘domains’, defined based on factors like application, sampling rate and codec. When such recognizers are used in conditions that do not match the training domain, performance significantly drops. In this paper, we explore the idea of building a single domain-invariant model that works well for varied use-cases. We do this by combining large scale training data from multiple application domains. Our final system is trained using 162,000 hours of speech. Additionally, each utterance is artificially distorted during training to simulate effects like background noise, codec distortion, and sampling rates. Our results show that, even at such a scale, a model thus trained works almost as well as those fine-tuned to specific subsets: A single model can be trained to be robust to multiple application domains, and other variations like codecs and noise. Such models also generalize better to unseen conditions and allow for rapid adaptation to new domains – we show that by using as little as 10 hours of data for adapting a domain-invariant model to a new domain, we can match performance of a domain-specific model trained from scratch using roughly 70 times as much data. We also highlight some of the limitations of such models and areas that need addressing in future work.
View details
Speech recognition for medical conversations
Chung-Cheng Chiu
Kat Chou
Chris Co
Navdeep Jaitly
Diana Jaunzeikare
Patrick Nguyen
Ananth Sankar
Justin Jesada Tansuwan
Nathan Wan
Frank Zhang
Interspeech 2018 (2018)
Preview abstract
In this paper we document our experiences with developing speech recognition for Medical Transcription -- a system that automatically transcribes notes from doctor-patient conversations. Towards this goal, we built a system along two different methodological lines -- a Connectionist Temporal Classification (CTC) phoneme based model and a Listen Attend and Spell (LAS) model. To train these models we used a corpus of anonymized conversations representing approximately 14,000 hours of speech . Because of noisy transcripts and alignments in the corpus, a significant amount of effort was invested in data cleaning issues. We describe a two-stage strategy we followed for segmenting the data. The data cleanup and development of a matched language model was essential to the success of the CTC based models. The LAS based models, however were found to be resilient to alignment and transcript noise and did not require the use of language models. CTC models were able to achieve a word error rate of 20.1%, and the LAS models were able to achieve 18.5%.
View details