Khe Chai Sim

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Knowledge distillation is an effective machine learning technique to transfer knowledge from teacher to student model. It is also a crucial component for learning from unlabeled data, for example, in Noisy Student Training. In this paper, we focus on knowledge distillation for the RNN-T model, which is widely used in state-of-the-art (SoTA) ASR. Specifically, we compared using soft and hard distillation targets to train large-scale RNN-T models on the LibriSpeech public dataset (60k hours) and our in-house data (600k hours). We found that hard targets are more effective when distilling from a larger teacher model to a smaller streaming student model. On the other hand, soft target distillation works better for when the teacher and student models have a similar network architecture. For a large model with 600M parameters, we can achieve a new SoTA word error rate (WER) on LibriSpeech (8% relative improvement on dev-other) using Noisy Student Training with soft targets. View details
    Preview abstract Almost none of the 2,000+ languages spoken in Africa have widely available automatic speech recognition systems, and the required data is also only available for a few languages. We have experimented with two techniques which may provide pathways to large vocabulary speech recognition for African languages: multilingual modeling and self-supervised learning. We gathered available open source data and collected data for 15 languages, and trained experimental models using these techniques. Our results show that pooling the small amounts of data available in multilingual end-to-end models, and pre-training on unsupervised data can help improve speech recognition quality for many African languages. View details
    Joint Unsupervised and Supervised Training for Multilingual ASR
    Yu Zhang
    IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2022), pp. 6402-6406
    Preview abstract Self-supervised training has been showing promising gains in pretraining models and facilitating the downstream finetuning for speech recognition. Effective self-supervised losses designed for large-scale unlabeled data can help learn the useful latent structures. Most existing methods adopt a 2-stage scheme where the self-supervised loss is optimized in the first pretraining stage, and the standard supervised finetuning resumes in the second stage. However, the pretrained checkpoint selection is known to be tricky and tedious, and pure finetuning can cause catastrophic forgetting of the learnt representations. To address these concerns, we propose an end-to-end (E2E) Joint Unsupervised and Supervised Training (JUST) method to combine the supervised RNN-T loss and the self-supervised contrastive and masked language modeling (MLM) losses. We apply our method to a challenging multilingual automatic speech recognition (ASR) task and validate its performance on the public dataset \textit{Multilingual LibriSpeech} (MLS), which includes 8 languages and is extremely imbalanced. On MLS, we explore (1) JUST trained from scratch, and (2) JUST finetuned from a pretrained checkpoint. Experiments show that JUST can consistently outperform other existing state-of-the-art (SOTA) methods by 10\%, and beat the monolingual baseline by a significant margin, demonstrating JUST's capability of handling low-resource languages in multilingual ASR. Our average WER of all languages outperforms monolingual baselines by 33.3\%, and the state-of-the-art 2-stage XLSR by 32\%. On low-resource language like Polish, our WER is less than half of the monolingual WER baseline and even beats the supervised transfer learning method using external supervision. View details
    Preview abstract Human labeling is expensive. Labeling is the most painful step for ML production. It’s widely believed that data is the new gold and big tech companies have an unfair advantage. Is it true that unlimited data unlimits model performance? In this study, we show 1k hrs human labeled data is enough for the best ASR model. The model trained with 1k hrs human labels and 26k hrs pseudo labels has better WERs than the model with 27k hrs human labels. Pseudo label training improves WERs of the production model by a significant margin; 5.9 to 5.1 on voice search. It means pseudo label quality is better than human label. To have quality pseudo labels, we utilized recent self/semi-supervised learning for a large ASR model. View details
    Preview abstract Self- and Semi-supervised learning methods have been actively investigated to reduce labeled training data or enhance the model performance. However, the approach mostly focus on in-domain performance for public datasets. In this study, we utilize the combination of self- and semi-supervised learning methods to solve unseen domain adaptation problem in a large-scale production setting for online ASR model. This approach demonstrates that using the source domain data with a small fraction of the target domain data (3%) can recover the performance gap compared to a full data baseline: relative 13.5% WER improvement for target domain data. View details
    Preview abstract Personalization of speech models on mobile devices (on-device personalization) is an active area of research, but more often than not, mobile devices have more text-only data than paired audio-text data. We explore training a personalized language model on text-only data, used during inference to improve speech recognition performance for that user. We experiment on a user-clustered LibriSpeech corpus, supplemented with personalized text-only data for each user from Project Gutenberg. We release this User-Specific LibriSpeech (UserLibri) dataset to aid future personalization research. LibriSpeech audio-transcript pairs are grouped into 55 users from the test-clean dataset and 52 users from test-other. We are able to lower the average word error rate per user across both sets in streaming and nonstreaming models, including an improvement of 2.5 for the harder set of test-other users when streaming. View details
    Preview abstract End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition. E2E models, however, present numerous challenges: In order to be truly useful, such models must decode speech utterances in a streaming fashion, in real time; they must be robust to the long tail of use cases; they must be able to leverage user-specific context (e.g., contact lists); and above all, they must be extremely accurate. In this work, we describe our efforts at building an E2E speech recognizer using a recurrent neural network transducer. In experimental evaluations, we find that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy in a number of evaluation categories. View details
    Preview abstract Speaker-independent speech recognition systems trained with data from many users are generally robust against speaker variability and works well for many unseen speakers. However, it still does not generalize well to users with very different speech characteristics. This issue can be addressed by building a personalized system that works well for each specific user. In this paper, we investigate into securely training personalized end-to-end speech recognition models on mobile devices so that user data and models are kept on mobile devices without communicating with a server. We study how the mobile training environment impacts the performance by simulating on-device data consumption. We conduct experiments using data collected from speech impaired users for personalization. Our results show that personalization achieved 63.7% relative word error rate reduction when trained in a server environment and 58.1% in a mobile environment. Moving to on-device personalization resulted in 18.7% performance degradation, in exchange for improved scalability and data privacy. To train the model on device, we split the gradient computation into two and achieved 45% memory reduction at the expense of 42% increase in training time. View details
    Preview abstract Current state-of-the-art automatic speech recognition systems are trained to work in specific ‘domains’, defined based on factors like application, sampling rate and codec. When such recognizers are used in conditions that do not match the training domain, performance significantly drops. In this paper, we explore the idea of building a single domain-invariant model that works well for varied use-cases. We do this by combining large scale training data from multiple application domains. Our final system is trained using 162,000 hours of speech. Additionally, each utterance is artificially distorted during training to simulate effects like background noise, codec distortion, and sampling rates. Our results show that, even at such a scale, a model thus trained works almost as well as those fine-tuned to specific subsets: A single model can be trained to be robust to multiple application domains, and other variations like codecs and noise. Such models also generalize better to unseen conditions and allow for rapid adaptation to new domains – we show that by using as little as 10 hours of data for adapting a domain-invariant model to a new domain, we can match performance of a domain-specific model trained from scratch using roughly 70 times as much data. We also highlight some of the limitations of such models and areas that need addressing in future work. View details
    Preview abstract Domain robustness is a challenging problem for automatic speech recognition (ASR). In this paper, we consider speech data collected for different applications as separate domains and investigate the robustness of acoustic models trained on multi-domain data on unseen domains. Specifically, we use Factorized Hidden Layer (FHL) as a compact low-rank representation to adapt a multi-domain ASR system to unseen domains. Experimental results on two unseen domains show that FHL is a more effective adaptation method compared to selectively fine-tuning part of the network, without dramatically increasing the model parameters. Furthermore, we found that using singular value decomposition to initialize the low-rank bases of an FHL model leads to a faster convergence and improved performance. View details