Comparison of Soft and Hard Target RNN-T Distillation for Large-scale ASR

Dongseong Hwang; Khe Chai Sim; Trevor Strohman; Yu Zhang

Comparison of Soft and Hard Target RNN-T Distillation for Large-scale ASR

Dongseong Hwang

Khe Chai Sim

Trevor Strohman

Yu Zhang

SLT 2022 (2023)

Download Google Scholar

Abstract

Knowledge distillation is an effective machine learning technique to transfer knowledge from teacher to student model. It is also a crucial component for learning from unlabeled data, for example, in Noisy Student Training.
In this paper, we focus on knowledge distillation for the RNN-T model, which is widely used in state-of-the-art (SoTA) ASR. Specifically, we compared using soft and hard distillation targets to train large-scale RNN-T models on the LibriSpeech public dataset (60k hours) and our in-house data (600k hours).
We found that hard targets are more effective when distilling from a larger teacher model to a smaller streaming student model. On the other hand, soft target distillation works better for when the teacher and student models have a similar network architecture.
For a large model with 600M parameters, we can achieve a new SoTA word error rate (WER) on LibriSpeech (8% relative improvement on dev-other) using Noisy Student Training with soft targets.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Comparison of Soft and Hard Target RNN-T Distillation for Large-scale ASR

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs