Hasim Sak
Research Areas
Authored Publications
Sort By
Turn-To-Diarize: Online speaker diarization constrained by transformer transducer speaker turn detection
Han Lu
Wei Xia
Submitted to ICASSP 2022, IEEE (2021)
Preview abstract
In this paper, we present a novel speaker diarization system for streaming on-device applications. In this system, we use a transformer transducer to detect the speaker turns, represent each speaker turn by a speaker embedding, then cluster these embeddings with constraints from the detected speaker turns.
Compared with conventional clustering-based diarization systems, our system largely reduces the computational cost of clustering due to the sparsity of speaker turns. Unlike other supervised speaker diarization systems which require annotations of timestamped speaker labels, our system only requires including speaker turn tokens during the transcribing process, which largely reduces the human efforts involved in data collection.
View details
Multilingual Speech Recognition with Self-Attention Structured Parameterization
Yun Zhu
Brian Farris
Hainan Xu
Han Lu
Pedro Jose Moreno Mengibar
Qian Zhang
Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, ISCA
Preview abstract
Multilingual automatic speech recognition systems can transcribe utterances from different languages. These systems are attractive from different perspectives: they can provide quality improvements, specially for lower resource languages, and simplify the training and deployment procedure. End-to-end speech recognition has further simplified multilingual modeling as one model, instead of several components of a classical system, have to be unified. In this paper, we investigate a streamable end-to-end multilingual system based on the Transformer Transducer. We propose several techniques for adapting the self-attention architecture based on the language id. We analyze the trade-offs of each method with regards to quality gains and number of additional parameters introduced. We conduct experiments in a real-world task consisting of five languages. Our experimental results demonstrate $\sim$10\% and $\sim$15\% relative gain over the baseline multilingual model.
View details
Preview abstract
Multilingual training has proven to improve acoustic modeling performance by sharing and transferring knowledge in modeling different languages. Knowledge sharing is usually achieved by using common lower-level layers for different languages in a deep neural network. Recently, the domain adversarial network was proposed to reduce domain mismatch of training data and learn domain-invariant features. It is thus worth exploring whether adversarial training can further promote knowledge sharing in multilingual models. In this work, we apply the domain adversarial network to encourage the shared layers of a multilingual model to learn language-invariant features. Bidirectional Long Short-Term Memory (LSTM) recurrent neural networks (RNN) are used as building blocks. We show that shared layers learned this way contain less language identification information and lead to better acoustic modeling performance. In an automatic speech recognition task for seven languages, the resultant acoustic model improves the word error rate (WER) of the multilingual model by a relative 4% on average, and the monolingual models by 10%.
View details
Speech recognition for medical conversations
Chung-Cheng Chiu
Kat Chou
Chris Co
Navdeep Jaitly
Diana Jaunzeikare
Patrick Nguyen
Ananth Sankar
Justin Jesada Tansuwan
Nathan Wan
Frank Zhang
Interspeech 2018 (2018)
Preview abstract
In this paper we document our experiences with developing speech recognition for Medical Transcription -- a system that automatically transcribes notes from doctor-patient conversations. Towards this goal, we built a system along two different methodological lines -- a Connectionist Temporal Classification (CTC) phoneme based model and a Listen Attend and Spell (LAS) model. To train these models we used a corpus of anonymized conversations representing approximately 14,000 hours of speech . Because of noisy transcripts and alignments in the corpus, a significant amount of effort was invested in data cleaning issues. We describe a two-stage strategy we followed for segmenting the data. The data cleanup and development of a matched language model was essential to the success of the CTC based models. The LAS based models, however were found to be resilient to alignment and transcript noise and did not require the use of language models. CTC models were able to achieve a word error rate of 20.1%, and the LAS models were able to achieve 18.5%.
View details
Preview abstract
We explore the viability of grapheme-based
recognition specifically how it compares to phoneme-based
equivalents. We utilize the CTC loss to train models to directly
predict graphemes, we also train models with hierarchical
CTC and show that they improve on previous CTC models.
We also explore how the grapheme and phoneme models
scale with large data sets, we consider a single acoustic training
data set where we combine various dialects of English from
US, UK, India and Australia. We show that by training a single
grapheme-based model on this multi-dialect data set we create
a accent-robust ASR system
View details
Acoustic Modeling for Google Home
Joe Caroselli
Kean Chin
Chanwoo Kim
Mitchel Weintraub
Erik McDermott
INTERSPEECH 2017 (2017)
Preview abstract
This paper describes the technical and system building advances made to the Google Home multichannel speech recognition system, which was launched in November 2016. Technical advances include an adaptive dereverberation frontend, the use of neural network models that do multichannel processing jointly with acoustic modeling, and grid lstms to model frequency variations. On the system level, improvements include adapting the model using Google Home specific data. We present results on a variety of multichannel sets. The combination of technical and system advances result in a reduction of WER of over 18\% relative compared to the current production system.
View details
Preview abstract
We investigate training end-to-end speech recognition models with the recurrent neural network
transducer (RNN-T): a streaming, all-neural, sequence-to-sequence architecture which jointly
learns acoustic and language model components from transcribed acoustic data.
We demonstrate how the model can be improved further if additional text or
pronunciation data are available. The model consists of an `encoder', which is initialized
from a connectionist temporal classification-based (CTC) acoustic model, and a
`decoder' which is partially initialized from a recurrent neural network language model trained on text data alone.
The entire neural network is trained with the RNN-T loss and directly outputs the recognized transcript
as a sequence of graphemes, thus performing end-to-end speech recognition. We find that performance
can be improved further through the use of sub-word units (`wordpieces') which capture longer context
and significantly reduce substitution errors. The best RNN-T system, a twelve-layer LSTM encoder with a
two-layer LSTM decoder trained with 30,000 wordpieces as output targets, is comparable in performance to a
state-of-the-art baseline on dictation and voice-search tasks.
View details
Preview abstract
We present a new procedure to train acoustic models from scratch for large vocabulary speech
recognition requiring no previous model for alignments or boot-strapping.
We augment the Connectionist Temporal Classification (CTC) objective function to allow training of acoustic models directly
from a parallel corpus of audio data and transcribed data. With this augmented CTC function
we train a phoneme recognition acoustic model directly from the written-domain transcript. Further,
we outline a mechanism to generate a context-dependent phonemes from a CTC model trained to predict phonemes
and ultimately train a second CTC model to predict these context-dependent phonemes. Since this approach does not
require training of any previous non-CTC model it drastically reduces the overall data-to-model training time from
30 days to 10 days. Additionally, models obtain from this flatstart-CTC procedure outperform the state-of-the-art by XX-XX\%.
View details
Preview abstract
We present results that show it is possible to build a competitive, greatly simplified, large vocabulary continuous speech recognition system with whole words as acoustic units. We model the output vocabulary of about 100,000 words directly using deep bi-directional LSTM RNNs with CTC loss. The model is trained on 125,000 hours of semi-supervised acoustic training data, which enables us to alleviate the data sparsity problem for word models. We show that the CTC word models work very well as an end-to-end all-neural speech recognition model without the use of traditional context-dependent sub-word phone units that require a pronunciation lexicon, and without any language model removing the need to decode. We demonstrate that the CTC word models perform better than a strong, more complex, state-of-the-art baseline with sub-word units.
View details
Personalized Speech Recognition On Mobile Devices
Raziel Alvarez
David Rybach
Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2016)
Preview abstract
We describe a large vocabulary speech recognition system that is accurate, has low latency, and yet has a small enough memory and computational footprint to run faster than real-time on a Nexus 5 Android smartphone. We employ a quantized Long Short-Term Memory (LSTM) acoustic model trained with connectionist temporal classification (CTC) to directly predict phoneme targets, and further reduce its memory footprint using an SVD-based compression scheme. Additionally, we minimize our memory footprint by using a single language model for both dictation and voice command domains, constructed using Bayesian interpolation. Finally, in order to properly handle device-specific information, such as proper names and other context-dependent information, we inject vocabulary items into the decoder graph and bias the language model on-the-fly. Our system achieves 13.5% word error rate on an open-ended dictation task, running with a median speed that is seven times faster than real-time.
View details