Comparison of Soft and Hard Target RNN-T Distillation for Large-scale ASR
Abstract
Knowledge distillation is an effective machine learning technique to transfer knowledge from teacher to student model. It is also a crucial component for learning from unlabeled data, for example, in Noisy Student Training.
In this paper, we focus on knowledge distillation for the RNN-T model, which is widely used in state-of-the-art (SoTA) ASR. Specifically, we compared using soft and hard distillation targets to train large-scale RNN-T models on the LibriSpeech public dataset (60k hours) and our in-house data (600k hours).
We found that hard targets are more effective when distilling from a larger teacher model to a smaller streaming student model. On the other hand, soft target distillation works better for when the teacher and student models have a similar network architecture.
For a large model with 600M parameters, we can achieve a new SoTA word error rate (WER) on LibriSpeech (8% relative improvement on dev-other) using Noisy Student Training with soft targets.
In this paper, we focus on knowledge distillation for the RNN-T model, which is widely used in state-of-the-art (SoTA) ASR. Specifically, we compared using soft and hard distillation targets to train large-scale RNN-T models on the LibriSpeech public dataset (60k hours) and our in-house data (600k hours).
We found that hard targets are more effective when distilling from a larger teacher model to a smaller streaming student model. On the other hand, soft target distillation works better for when the teacher and student models have a similar network architecture.
For a large model with 600M parameters, we can achieve a new SoTA word error rate (WER) on LibriSpeech (8% relative improvement on dev-other) using Noisy Student Training with soft targets.