Full-Sum Distillation: A Robust Knowledge Distillation Method for RNN-T Models With Noisy Training Labels

Bhuvana Ramabhadran; Kartik Audhkhasi; Mohammad Zeineldeen; Murali Karthick Baskar

Full-Sum Distillation: A Robust Knowledge Distillation Method for RNN-T Models With Noisy Training Labels

Bhuvana Ramabhadran

Kartik Audhkhasi

Mohammad Zeineldeen

Murali Karthick Baskar

ICASSP 2023 (2023)

Download Google Scholar

Abstract

Hard and soft distillation are two popular approaches for knowledge distillation from a teacher to student ASR model. Despite soft distillation being better than hard distillation, it has several limitations. First, training convergence depends on the match between the teacher and student alignments. Second, soft distillation suffers quality regressions when using teacher and student models with different architectures. Third, in case of non-causal teacher models, soft distillation requires tuning of the shift in teacher alignments to the right. Finally, soft distillation requires both the teacher and student models to have the same temporal sampling rates. In this work, we propose a novel knowledge distillation method for RNN-T models that tackles limitations of both hard and soft distillation approaches. We call our method Full-sum distillation, which simply distills the sequence posterior probability of the teacher model to the student model. Thus, this method does not depend directly on the noisy labels to distill knowledge as well as it does not depend on time dimension. We also propose a variant of Full-sum distillation to distill the sequence discriminative knowledge of the teacher model to the student model to further improve performance. Using full-sum distillation, we achieve significant improvements when training with strong and weak teacher models on public data as well as on in-house production data.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Full-Sum Distillation: A Robust Knowledge Distillation Method for RNN-T Models With Noisy Training Labels

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs