Full-Sum Distillation: A Robust Knowledge Distillation Method for RNN-T Models With Noisy Training Labels
Abstract
Hard and soft distillation are two popular approaches for knowledge distillation from a teacher to student ASR model. Despite soft distillation being better than hard distillation, it has several limitations. First, training convergence depends on the match between the teacher and student alignments. Second, soft distillation suffers quality regressions when using teacher and student models with different architectures. Third, in case of non-causal teacher models, soft distillation requires tuning of the shift in teacher alignments to the right. Finally, soft distillation requires both the teacher and student models to have the same temporal sampling rates. In this work, we propose a novel knowledge distillation method for RNN-T models that tackles limitations of both hard and soft distillation approaches. We call our method Full-sum distillation, which simply distills the sequence posterior probability of the teacher model to the student model. Thus, this method does not depend directly on the noisy labels to distill knowledge as well as it does not depend on time dimension. We also propose a variant of Full-sum distillation to distill the sequence discriminative knowledge of the teacher model to the student model to further improve performance. Using full-sum distillation, we achieve significant improvements when training with strong and weak teacher models on public data as well as on in-house production data.