Layerwise Bregman Representation Learning of Neural Networks with Applications to Knowledge Distillation

Ehsan Amid
Rohan Anil
Christopher Fifty
Transactions on Machine Learning Research, 02/23(2023)


We propose a new method for layerwise representation learning of a trained neural network that conforms to the non-linearity of the layer’s transfer function. In particular, we form a Bregman divergence based on the convex function induced by the layer’s transfer function and construct an extension of the original Bregman PCA formulation by incorporating a mean vector and revising the normalization constraint on the principal directions. These modifications allow exporting the learned representation as a fixed layer with a non-linearity. As an application to knowledge distillation, we cast the learning problem for the student network as predicting the compression coefficients of the teacher’s representations, which is then passed as the input to the imported layer. Our empirical findings indicate that our approach is substantially more effective for transferring information between networks than typical teacher-student training that uses the teacher’s soft labels.

Research Areas