Max Vladymyrov
Research Areas
Authored Publications
Sort By
Preview abstract
We focus on the problem of learning without forgetting from multiple tasks arriving sequentially, where each task is defined using a few-shot episode of novel or already seen classes. We approach this problem using the recently published HyperTransformer (HT), a Transformer-based hypernetwork that generates specialized task-specific CNN weights directly from the support set. In order to learn from a continual sequence of tasks, we propose to recursively re-use the generated weights as input to the HT for the next task. This way, the generated CNN weights themselves act as a representation of previously learned tasks, and the HT is trained to update these weights so that the new task can be learned without forgetting past tasks. This approach is different from most continual learning algorithms that typically rely on using replay buffers, weight regularization or task-dependent architectural changes. We demonstrate that our proposed Continual HyperTransformer method equipped with a prototypical loss is capable of learning and retaining knowledge about past tasks for a variety of scenarios, including learning from mini-batches, and task-incremental and class-incremental learning scenarios.
View details
Transformers learn in-context by gradient descent
João Sacramento
International Conference on Machine Learning (2023), pp. 35151-35174
Preview abstract
Transformers have become the state-of-the-art neural network architecture across numerous
domains of machine learning. This is partly due to their celebrated ability to transfer and
to learn in-context based on a few examples. Nevertheless, the mechanism of why and
how Transformers become in-context learners is not well understood and remains mostly an
intuition. Here, we argue that training Transformers on auto-regressive tasks can be closely
related to well-known gradient-based meta-learning formulations. We do so by providing
a simple construction that shows the equivalence of data transformations induced by 1) a
single linear self-attention layer and by 2) gradient-descent on a regression loss. Motivated by
that construction, we show empirically that when training self-attention only Transformers
on simple regression tasks either the models learned by GD and Transformers show great
similarity or, remarkably, the solutions found by gradient descent converge in weight space to
our construction. This allows us, at least on our simple regression tasks, to mechanistically
understand the inner workings of Transformers that enables in-context learning within.
Finally, we discuss intriguing parallels to a mechanism identified as crucial for in-context
learning termed induction-head (Olsson et al., 2022) and show how it could be generalized
by in-context learning by gradient descent within Transformers.
View details
GradMax: Growing Neural Networks using Gradient Information
Bart van Merriënboer
Thomas Unterthiner
The International Conference on Learning Representations (2022)
Preview abstract
The architecture and the parameters of neural networks are often optimized independently, which requires costly retraining of the parameters whenever the architecture is modified. In this work we instead focus on growing the architecture without requiring costly retraining. We present a method that adds new neurons during training without impacting what is already learned, while improving the training dynamics. We achieve the latter by maximizing the gradients of the new weights and find the optimal initialization efficiently by means of the singular value decomposition (SVD). We call this technique Gradient Maximizing Growth (GradMax) and demonstrate its effectiveness in variety of vision tasks and architectures.
View details
Preview abstract
In this paper we propose augmenting Vision Transformer models with learnable memory tokens. Our approach allows the model to adapt to new tasks, using few parameters, while optionally preserving its capabilities on previously learned tasks. At each layer we introduce a set of learnable embedding vectors that provide contextual information useful for specific datasets. We call these "memory tokens". We show that augmenting a model with just a handful of such tokens per layer significantly improves accuracy when compared to conventional head-only fine-tuning, and performs only slightly below the significantly more expensive full fine-tuning. We then propose an attention-masking approach that enables extension to new downstream tasks, with a computation reuse. In this setup in addition to being parameters efficient, models can execute both old and new tasks as a part of single inference at a small incremental cost.
View details
Preview abstract
In this work we propose a HyperTransformer, a Transformer-based model for supervised and semi-supervised few-shot learning that generates weights of a convolutional neural network (CNN) directly from support samples. Since the dependence of a small generated CNN model on a specific task is encoded by a high-capacity Transformer model, we effectively decouple the complexity of the large task space from the complexity of individual tasks. Our method is particularly effective for small target CNN architectures where learning a fixed universal task-independent embedding is not optimal and better performance is attained when the information about the task can modulate all model parameters. For larger models we discover that generating the last layer alone allows us to produce competitive or better results than those obtained with state-of-the-art methods while being end-to-end differentiable.
View details
Meta-Learning Bidirectional Update Rules
Andrew Jackson
Tom Madams
ICML 2021
Preview abstract
In this paper, we introduce a new type of generalized neural network where neurons and synapses maintain multiple states. We show that classical gradient-based backpropagation in neural networks can be seen as a special case of a two-state network where one state is used for activations and another for gradients, with update rules derived from the chain rule. In our generalized framework, networks have neither explicit notion of nor ever receive gradients. The synapses and neurons are updated using a bidirectional Hebb-style update rule parameterized by a shared low-dimensional "genome". We show that such genomes can be meta-learned from scratch, using either conventional optimization techniques, or evolutionary strategies, such as CMA-ES. Resulting update rules generalize to unseen tasks and train faster than gradient descent based optimizers for several standard computer vision and synthetic tasks.
View details
Underspecification Presents Challenges for Credibility in Modern Machine Learning
Dan Moldovan
Ben Adlam
Babak Alipanahi
Alex Beutel
Christina Chen
Jon Deaton
Matthew D. Hoffman
Shaobo Hou
Neil Houlsby
Ghassen Jerfel
Yian Ma
Diana Mincu
Akinori Mitani
Andrea Montanari
Christopher Nielsen
Thomas Osborne
Rajiv Raman
Kim Ramasamy
Martin Gamunu Seneviratne
Shannon Sequeira
Harini Suresh
Victor Veitch
Steve Yadlowsky
Journal of Machine Learning Research (2020)
Preview abstract
ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain.
View details
No Pressure! Addressing Problem of Local Minima in Manifold Learning
33th Annual Conference on Neural Information Processing Systems (2019)
Preview abstract
Nonlinear embedding manifold learning methods provide an invaluable visual insights into a structure of the high-dimensional data. However due to a complicated nonlinear objective function, these methods can be easily stuck in local minima and their embedding quality can be poor. We propose a natural extension to several manifold learning methods aimed at identifying pressured points, i.e. points that stuck in the poor local minima and have poor embedding quality. We show that the pressure can be decreased by temporarily allowing these points to make use of an extra dimension in the embedding space. In the evaluation we show that our method is able to improve the objective function value of existing methods even after they get stuck in a poor local minimum.
View details