- Samuel Stanton
- Pavel Izmailov
- Polina Kirichenko
- Alex Alemi
- Andrew Gordon Wilson
Abstract
Knowledge distillation is a popular technique for training a small student network to match a larger teacher model, such as an ensemble of networks.In this paper, we show that while knowledge distillation has a useful regularizing effect, it does not typically work as it is commonly understood:there often remains a surprisingly large discrepancy between the predictive distributions of the teacher and the student, even in cases when the student has the capacity to perfectly match the teacher. We show that the dataset used for distillation and the amount of temperature scaling applied to the logits play a crucial role in how closely the student matches the teacher, and discuss optimal ways of setting these hyper-parameters inpractice.
Research Areas
Learn more about how we do research
We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work