Andrey Zhmoginov
Research Areas
Authored Publications
Sort By
Preview abstract
We focus on the problem of learning without forgetting from multiple tasks arriving sequentially, where each task is defined using a few-shot episode of novel or already seen classes. We approach this problem using the recently published HyperTransformer (HT), a Transformer-based hypernetwork that generates specialized task-specific CNN weights directly from the support set. In order to learn from a continual sequence of tasks, we propose to recursively re-use the generated weights as input to the HT for the next task. This way, the generated CNN weights themselves act as a representation of previously learned tasks, and the HT is trained to update these weights so that the new task can be learned without forgetting past tasks. This approach is different from most continual learning algorithms that typically rely on using replay buffers, weight regularization or task-dependent architectural changes. We demonstrate that our proposed Continual HyperTransformer method equipped with a prototypical loss is capable of learning and retaining knowledge about past tasks for a variety of scenarios, including learning from mini-batches, and task-incremental and class-incremental learning scenarios.
View details
Transformers learn in-context by gradient descent
João Sacramento
International Conference on Machine Learning (2023), pp. 35151-35174
Preview abstract
Transformers have become the state-of-the-art neural network architecture across numerous
domains of machine learning. This is partly due to their celebrated ability to transfer and
to learn in-context based on a few examples. Nevertheless, the mechanism of why and
how Transformers become in-context learners is not well understood and remains mostly an
intuition. Here, we argue that training Transformers on auto-regressive tasks can be closely
related to well-known gradient-based meta-learning formulations. We do so by providing
a simple construction that shows the equivalence of data transformations induced by 1) a
single linear self-attention layer and by 2) gradient-descent on a regression loss. Motivated by
that construction, we show empirically that when training self-attention only Transformers
on simple regression tasks either the models learned by GD and Transformers show great
similarity or, remarkably, the solutions found by gradient descent converge in weight space to
our construction. This allows us, at least on our simple regression tasks, to mechanistically
understand the inner workings of Transformers that enables in-context learning within.
Finally, we discuss intriguing parallels to a mechanism identified as crucial for in-context
learning termed induction-head (Olsson et al., 2022) and show how it could be generalized
by in-context learning by gradient descent within Transformers.
View details
Preview abstract
In this work we propose a HyperTransformer, a Transformer-based model for supervised and semi-supervised few-shot learning that generates weights of a convolutional neural network (CNN) directly from support samples. Since the dependence of a small generated CNN model on a specific task is encoded by a high-capacity Transformer model, we effectively decouple the complexity of the large task space from the complexity of individual tasks. Our method is particularly effective for small target CNN architectures where learning a fixed universal task-independent embedding is not optimal and better performance is attained when the information about the task can modulate all model parameters. For larger models we discover that generating the last layer alone allows us to produce competitive or better results than those obtained with state-of-the-art methods while being end-to-end differentiable.
View details
Preview abstract
In this paper we propose augmenting Vision Transformer models with learnable memory tokens. Our approach allows the model to adapt to new tasks, using few parameters, while optionally preserving its capabilities on previously learned tasks. At each layer we introduce a set of learnable embedding vectors that provide contextual information useful for specific datasets. We call these "memory tokens". We show that augmenting a model with just a handful of such tokens per layer significantly improves accuracy when compared to conventional head-only fine-tuning, and performs only slightly below the significantly more expensive full fine-tuning. We then propose an attention-masking approach that enables extension to new downstream tasks, with a computation reuse. In this setup in addition to being parameters efficient, models can execute both old and new tasks as a part of single inference at a small incremental cost.
View details
Preview abstract
Conditional computation and modular networks have been recently proposed for multitask learning and other problems as a way to decompose problem solving into multiple reusable computational blocks. We propose a new approach for learning modular networks based on the isometric version of ResNet with all residual blocks having the same configuration and the same number of parameters. This architectural choice allows adding, removing and changing the order of residual blocks. In our method, the modules can be invoked repeatedly and allow knowledge transfer to novel tasks by adjusting the order of computation. This allows soft weight sharing between tasks with only a small increase in the number of parameters. We show that our method leads to interpretable self-organization of modules in case of multi-task learning, transfer learning and domain adaptation while achieving competitive results on those tasks. From practical perspective, our approach allows to: (a) reuse existing modules for learning new task by adjusting the computation order, (b) use it for unsupervised multi-source domain adaptation to illustrate that adaptation to unseen data can be achieved by only manipulating the order of pretrained modules, (c) show how our approach can be used to increase accuracy of existing architectures for image classification tasks such as ImageNet, without any parameter increase, by reusing the same block multiple times.
View details
Meta-Learning Bidirectional Update Rules
Andrew Jackson
Tom Madams
ICML 2021
Preview abstract
In this paper, we introduce a new type of generalized neural network where neurons and synapses maintain multiple states. We show that classical gradient-based backpropagation in neural networks can be seen as a special case of a two-state network where one state is used for activations and another for gradients, with update rules derived from the chain rule. In our generalized framework, networks have neither explicit notion of nor ever receive gradients. The synapses and neurons are updated using a bidirectional Hebb-style update rule parameterized by a shared low-dimensional "genome". We show that such genomes can be meta-learned from scratch, using either conventional optimization techniques, or evolutionary strategies, such as CMA-ES. Resulting update rules generalize to unseen tasks and train faster than gradient descent based optimizers for several standard computer vision and synthetic tasks.
View details
BasisNet: Two-Stage Model Synthesis for Efficient Inference
Chun-Te Chu
Andrew Howard
Yukun Zhu
Rebecca Hwa
Adriana Kovashka
CVPR Workshop on Efficient Deep Learning for Computer Vision (ECV) (2021)
Preview abstract
In this work, we present BasisNet which combines recent advancements in efficient neural network architectures, conditional computation, and early termination in a simple new form. Our approach incorporates a lightweight model to preview the input and generate input-dependent combination coefficients, which later controls the synthesis of a more accurate specialist model to make final prediction. The two-stage model synthesis strategy can be applied to any network architectures and both stages are jointly trained. We also show that proper training recipes are critical for increasing generalizability for such high capacity neural networks. On ImageNet classification benchmark, our BasisNet with MobileNets as backbone demonstrated clear advantage on accuracy-efficiency trade-off over several strong baselines. Specifically, BasisNet-MobileNetV3 obtained 80.3% top-1 accuracy with only 290M Multiply-Add operations, halving the computational cost of previous state-of-the-art without sacrificing accuracy. With early termination, the average cost can be further reduced to 198M MAdds while maintaining accuracy of 80.0% on ImageNet.
View details
K for the price of 1. Parameter efficient multi-task and transfer learning
Andrew Howard
International Conference on Learning Representations (2019)
Preview abstract
In this paper we introduce a novel method that enables parameter efficient transfer and multitask learning.
We show that by reusing more than 95\% of the parameters we can re-purpose neural networks to solve very
different types of problems such as going from COCO-dataset SSD detection to Imagenet classification.
Our approach allows both simultaneous (e.g. multi-task) learning as well as sequential fine-tuning where
we change the already trained networks to solve a different problem.
We show that our approach leads to significant increase in accuracy when compared to traditional logits-only fine-tuning
while using much fewer parameters. Interestingly, for multi-task learning our approach sometimes acts as a regularizer often leading
to improved performance when compared to models trained on a single task.
Our approach has multiple immediate applications. It can be used to dramatically increase the number of models available in resource-constrained settings, since the marginal cost of a new model is now less than 5\% of the full model. The constrained fine-tuning enables better generalization when limited amount data is available. We evaluate our approach on multiple datasets and multiple models.
View details
Preview abstract
We explore the question of how the resolution of input image affects the performance of a neural network when compared to the resolution of hidden layers. Image resolution is frequently used as a hyper parameter providing a trade-off between model performance and accuracy. An intuitive interpretation is that the decay in accuracy when reducing input resolution, is caused by the reduced information content in the low-resolution input. Left unsaid often the fact that this also reduces the model's internal resolution. In this paper, we show
that up-to a point the resolution plays very little role in the network performance. We show that another obvious hypothesis, such as changes in receptive fields, is not the primary root causes either. We then use this insight, to develop novel neural network architectures that we call {\it isometric neural networks} that maintain fixed internal resolution throughout their entire depth and demonstrate that it lead of high accuracy models with low activation footprint and a parameter count.
\end{abstract}
View details
Preview abstract
A new method for learning image attention masks in a semi-supervised setting based on the Information Bottleneck principle is proposed. Provided with a set of labeled images, the mask generation model is minimizing mutual information between the input and the masked image while maximizing the mutual information between the same masked image and the image label. In contrast with other approaches, our attention model produces a boolean rather than a continuous mask thus entirely concealing information from masked-out pixels. Using a set of synthetic datasets based on MNIST and CIFAR10 and a SVHN dataset, we demonstrate that our method can successfully attend to features defining the image class. We also discuss potential drawbacks of our methods and propose a mask randomization technique to alleviate one of them.
View details