Manfred K. Warmuth
Research Areas
Authored Publications
Sort By
Layerwise Bregman Representation Learning of Neural Networks with Applications to Knowledge Distillation
Ehsan Amid
Rohan Anil
Christopher Fifty
Transactions on Machine Learning Research, 02/23 (2023)
Preview abstract
We propose a new method for layerwise representation learning of a trained neural network that conforms to the non-linearity of the layer’s transfer function. In particular, we form a Bregman divergence based on the convex function induced by the layer’s transfer function and construct an extension of the original Bregman PCA formulation by incorporating a mean vector and revising the normalization constraint on the principal directions. These modifications allow exporting the learned representation as a fixed layer with a non-linearity. As an application to knowledge distillation, we cast the learning problem for the student network as predicting the compression coefficients of the teacher’s representations, which is then passed as the input to the imported layer. Our empirical findings indicate that our approach is substantially more effective for transferring information between networks than typical teacher-student training that uses the teacher’s soft labels.
View details
Rank-smoothed Pairwise Learning in Perceptual Quality Assessment
Ehsan Amid
(ICIP 2020) 2020 IEEE International Conference on Image Processing (2020)
Preview abstract
Conducting pairwise comparisons is a widely used approach in curating human perceptual preference data. Typically raters are instructed to make their choices according to a specific set of rules that address certain dimensions of image quality and aesthetics. The outcome of this process is a dataset of sampled image pairs with their associated empirical preference probabilities. Training a model on these pairwise preferences is a common deep learning approach. However, optimizing by gradient descent through mini-batch learning means that the “global” ranking of the images is not explicitly taken into account. In other words, each step of the gradient descent relies only on a limited number of pairwise comparisons. In this work, we demonstrate that regularizing the pairwise empirical probabilities with aggregated rankwise probabilities leads to a more reliable training loss. We show that training a deep image quality assessment model with our rank-smoothed loss consistently improves the accuracy of predicting human preferences.
View details
Preview abstract
In experimental design, we are given a large collection of vectors, each with a hidden response
value that we assume derives from an underlying linear model, and we wish to pick a small subset
of the vectors such that querying the corresponding responses will lead to a good estimator of the
model. A classical approach in statistics is to assume the responses are linear, plus zero-mean
i.i.d. Gaussian noise, in which case the goal is to provide an unbiased estimator with smallest mean
squared error (A-optimal design). A related approach, more common in computer science, is to
assume the responses are arbitrary but fixed, in which case the goal is to estimate the least squares
solution using few responses, as quickly as possible, for worst-case inputs. Despite many attempts,
characterizing the relationship between these two approaches has proven elusive. We address this
by proposing a framework for experimental design where the responses are produced by an arbitrary
unknown distribution. We show that there is an efficient randomized experimental design procedure
that achieves strong variance bounds for an unbiased estimator using few responses in this general
model. Nearly tight bounds for the classical A-optimality criterion, as well as improved bounds
for worst-case responses, emerge as special cases of this result. In the process, we develop a
new algorithm for a joint sampling distribution called volume sampling, and we propose a new
i.i.d. importance sampling method: inverse score sampling. A key novelty of our analysis is in
developing new expected error bounds for worst-case regression by controlling the tail behavior
of i.i.d. sampling via the jointness of volume sampling. Our result motivates a new minimaxoptimality criterion for experimental design which can be viewed as an extension of both A-optimal
design and sampling for worst-case regression.
View details
Robust Bi-Tempered Logistic Loss Based on Bregman Divergences
Ehsan Amid
Rohan Anil
Thirty-Third Annual Conference on Neural Information Processing Systems (NeurIPS) (2019)
Preview abstract
We introduce a temperature into the exponential function and replace the softmax output layer of neural nets by a high temperature generalization. Similarly, the logarithm in the log loss we use for training is replaced by a low temperature logarithm. By tuning the two temperatures we create loss functions that are non-convex already in the single layer case. When replacing the last layer of the neural nets by our bi-temperature generalization of logistic loss, the training becomes more robust to noise. We visualize the effect of tuning the two temperatures in a simple setting and show the efficacy of our method on large data sets. Our methodology is based on Bregman divergences and is superior to a related two-temperature method using the Tsallis divergence.
View details
Preview abstract
We consider online learning with linear models,
where the algorithm predicts on sequentially revealed instances (feature vectors), and is compared against the best linear function (comparator) in hindsight. Popular algorithms in this framework, such as Online Gradient Descent (OGD),
have parameters (learning rates), which ideally
should be tuned based on the scales of the features
and the optimal comparator, but these quantities
only become available at the end of the learning process. In this paper, we resolve the tuning
problem by proposing online algorithms making
predictions which are invariant under arbitrary
rescaling of the features. The algorithms have
no parameters to tune, do not require any prior
knowledge on the scale of the instances or the
comparator, and achieve regret bounds matching
(up to a logarithmic factor) that of OGD with optimally tuned separate learning rates per dimension,
while retaining comparable runtime performance.
View details