Jump to Content

Kartik Audhkhasi

Kartik Audhkhasi is a Research Scientist in the Speech Group at Google. His research interests include automatic speech recognition, neural networks, and machine learning. Kartik received his Ph.D. in Electrical Engineering from University of Southern California, USA in 2014. He was a Research Staff Member in the Speech Group at IBM Research AI from 2014 to 2020.

Research Areas

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract Hard and soft distillation are two popular approaches for knowledge distillation from a teacher to student ASR model. Despite soft distillation being better than hard distillation, it has several limitations. First, training convergence depends on the match between the teacher and student alignments. Second, soft distillation suffers quality regressions when using teacher and student models with different architectures. Third, in case of non-causal teacher models, soft distillation requires tuning of the shift in teacher alignments to the right. Finally, soft distillation requires both the teacher and student models to have the same temporal sampling rates. In this work, we propose a novel knowledge distillation method for RNN-T models that tackles limitations of both hard and soft distillation approaches. We call our method Full-sum distillation, which simply distills the sequence posterior probability of the teacher model to the student model. Thus, this method does not depend directly on the noisy labels to distill knowledge as well as it does not depend on time dimension. We also propose a variant of Full-sum distillation to distill the sequence discriminative knowledge of the teacher model to the student model to further improve performance. Using full-sum distillation, we achieve significant improvements when training with strong and weak teacher models on public data as well as on in-house production data. View details
    Regularizing Word Segmentation by Creating Misspellings
    Hainan Xu
    Jesse Emond
    Yinghui Huang
    Interspeech 2021 (2021) (to appear)
    Preview abstract This work focuses on improving subword segmentation algorithms for end-to-end speech recognition models, and makes two major contributions. Firstly, we propose a novel word segmentation algorithm. The algorithm uses the same vocabulary file generated by a regular wordpiece model, is easily extensible and supports a variety of regularization techniques in the segmentation space, and outperforms the regular wordpiece model. Secondly, we propose a number of novel regularization methods that introduces randomness into the tokenization algorithm, which bring further gains in speech recognition performance. A noteworthy discovery from this work is that creating artificial misspelling in words results in the best performance among all the methods, which could inspire future research for strategies in this area. View details
    Preview abstract Regularization and data augmentation are crucial to training end-to-end automatic speech recognition systems. Dropout is a popular regularization technique, which operates on each neuron independently by multiplying it with a Bernoulli random variable. We propose a generalization of dropout, called ``convolutional dropout'', where each neuron's activation is replaced with a randomly-weighted linear combination of neuron values in its neighborhood. We believe that this formulation combines the regularizing effect of dropout with the smoothing effects of the convolution operation. In addition to convolutional dropout, this paper also proposes using random wordpiece segmentations as a data augmentation scheme during training, inspired by results in neural machine translation. We adopt both these methods during the training of transformer-transducer speech recognition models, and show consistent improvements over strong baselines across different languages. View details
    Preview abstract Streaming automatic speech recognition (ASR) hypothesizes words as soon as the input audio arrives, whereas non-streaming ASR can potentially wait for the completion of the entire utterance to hypothesize words. Streaming and non-streaming ASR systems have typically used different acoustic encoders. Recent work has attempted to unify them by either jointly training a fixed stack of streaming and non-streaming layers or using knowledge distillation during training to ensure consistency between the streaming and non-streaming predictions. We propose mixture model (MiMo) attention as a simpler and theoretically-motivated alternative that replaces only the attention mechanism, requires no change to the training loss, and allows greater flexibility of switching between streaming and non-streaming mode during inference. Our experiments on the public Librispeech data set and a few Indic language data sets show that MiMo attention endows a single ASR model with the ability to operate in both streaming and non-streaming modes without any overhead and without significant loss in accuracy compared to separately-trained streaming and non-streaming models. View details
    No Results Found