Neighbourhood Distillation: On the benefits of non end-to-end distillation
Abstract
Knowledge Distillation is a popular method to reduce model size by transferring the knowledge of a large teacher model to a smaller student network. We show that it is possible to independently replace sub-parts of a network without accuracy loss. Based on this, we propose a distillation method that breaks the end-to-end paradigm by splitting the teacher architecture into smaller sub-networks - also called neighbourhoods. For each neighbourhood we distill a student independently and then merge them into a single student model. We show that this process is significantly faster than Knowledge Distillation, and produces students of the same quality.
From Neighbourhood Distillation, we design Student Search, an architecture search that leverages the independently distilled candidates to explore an exponentially large search space of architectures and locally selects the best candidate to use for the student model.
We show applications of Neighbourhood Distillation and Student Search on CIFAR-10 and ImageNet models on model reduction and sparsification problems. Our method offers up to $4.6\times$ speed-up compared to end-to-end distillation methods while retaining the same performance.
From Neighbourhood Distillation, we design Student Search, an architecture search that leverages the independently distilled candidates to explore an exponentially large search space of architectures and locally selects the best candidate to use for the student model.
We show applications of Neighbourhood Distillation and Student Search on CIFAR-10 and ImageNet models on model reduction and sparsification problems. Our method offers up to $4.6\times$ speed-up compared to end-to-end distillation methods while retaining the same performance.