Andrey Zhmoginov

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We focus on the problem of learning without forgetting from multiple tasks arriving sequentially, where each task is defined using a few-shot episode of novel or already seen classes. We approach this problem using the recently published HyperTransformer (HT), a Transformer-based hypernetwork that generates specialized task-specific CNN weights directly from the support set. In order to learn from a continual sequence of tasks, we propose to recursively re-use the generated weights as input to the HT for the next task. This way, the generated CNN weights themselves act as a representation of previously learned tasks, and the HT is trained to update these weights so that the new task can be learned without forgetting past tasks. This approach is different from most continual learning algorithms that typically rely on using replay buffers, weight regularization or task-dependent architectural changes. We demonstrate that our proposed Continual HyperTransformer method equipped with a prototypical loss is capable of learning and retaining knowledge about past tasks for a variety of scenarios, including learning from mini-batches, and task-incremental and class-incremental learning scenarios. View details
    Preview abstract Transformers have become the state-of-the-art neural network architecture across numerous domains of machine learning. This is partly due to their celebrated ability to transfer and to learn in-context based on a few examples. Nevertheless, the mechanism of why and how Transformers become in-context learners is not well understood and remains mostly an intuition. Here, we argue that training Transformers on auto-regressive tasks can be closely related to well-known gradient-based meta-learning formulations. We do so by providing a simple construction that shows the equivalence of data transformations induced by 1) a single linear self-attention layer and by 2) gradient-descent on a regression loss. Motivated by that construction, we show empirically that when training self-attention only Transformers on simple regression tasks either the models learned by GD and Transformers show great similarity or, remarkably, the solutions found by gradient descent converge in weight space to our construction. This allows us, at least on our simple regression tasks, to mechanistically understand the inner workings of Transformers that enables in-context learning within. Finally, we discuss intriguing parallels to a mechanism identified as crucial for in-context learning termed induction-head (Olsson et al., 2022) and show how it could be generalized by in-context learning by gradient descent within Transformers. View details
    Preview abstract In this work we propose a HyperTransformer, a Transformer-based model for supervised and semi-supervised few-shot learning that generates weights of a convolutional neural network (CNN) directly from support samples. Since the dependence of a small generated CNN model on a specific task is encoded by a high-capacity Transformer model, we effectively decouple the complexity of the large task space from the complexity of individual tasks. Our method is particularly effective for small target CNN architectures where learning a fixed universal task-independent embedding is not optimal and better performance is attained when the information about the task can modulate all model parameters. For larger models we discover that generating the last layer alone allows us to produce competitive or better results than those obtained with state-of-the-art methods while being end-to-end differentiable. View details
    Preview abstract In this paper we propose augmenting Vision Transformer models with learnable memory tokens. Our approach allows the model to adapt to new tasks, using few parameters, while optionally preserving its capabilities on previously learned tasks. At each layer we introduce a set of learnable embedding vectors that provide contextual information useful for specific datasets. We call these "memory tokens". We show that augmenting a model with just a handful of such tokens per layer significantly improves accuracy when compared to conventional head-only fine-tuning, and performs only slightly below the significantly more expensive full fine-tuning. We then propose an attention-masking approach that enables extension to new downstream tasks, with a computation reuse. In this setup in addition to being parameters efficient, models can execute both old and new tasks as a part of single inference at a small incremental cost. View details
    Preview abstract Conditional computation and modular networks have been recently proposed for multitask learning and other problems as a way to decompose problem solving into multiple reusable computational blocks. We propose a new approach for learning modular networks based on the isometric version of ResNet with all residual blocks having the same configuration and the same number of parameters. This architectural choice allows adding, removing and changing the order of residual blocks. In our method, the modules can be invoked repeatedly and allow knowledge transfer to novel tasks by adjusting the order of computation. This allows soft weight sharing between tasks with only a small increase in the number of parameters. We show that our method leads to interpretable self-organization of modules in case of multi-task learning, transfer learning and domain adaptation while achieving competitive results on those tasks. From practical perspective, our approach allows to: (a) reuse existing modules for learning new task by adjusting the computation order, (b) use it for unsupervised multi-source domain adaptation to illustrate that adaptation to unseen data can be achieved by only manipulating the order of pretrained modules, (c) show how our approach can be used to increase accuracy of existing architectures for image classification tasks such as ImageNet, without any parameter increase, by reusing the same block multiple times. View details
    Preview abstract In this paper, we introduce a new type of generalized neural network where neurons and synapses maintain multiple states. We show that classical gradient-based backpropagation in neural networks can be seen as a special case of a two-state network where one state is used for activations and another for gradients, with update rules derived from the chain rule. In our generalized framework, networks have neither explicit notion of nor ever receive gradients. The synapses and neurons are updated using a bidirectional Hebb-style update rule parameterized by a shared low-dimensional "genome". We show that such genomes can be meta-learned from scratch, using either conventional optimization techniques, or evolutionary strategies, such as CMA-ES. Resulting update rules generalize to unseen tasks and train faster than gradient descent based optimizers for several standard computer vision and synthetic tasks. View details
    BasisNet: Two-Stage Model Synthesis for Efficient Inference
    Chun-Te Chu
    Andrew Howard
    Yukun Zhu
    Rebecca Hwa
    Adriana Kovashka
    CVPR Workshop on Efficient Deep Learning for Computer Vision (ECV) (2021)
    Preview abstract In this work, we present BasisNet which combines recent advancements in efficient neural network architectures, conditional computation, and early termination in a simple new form. Our approach incorporates a lightweight model to preview the input and generate input-dependent combination coefficients, which later controls the synthesis of a more accurate specialist model to make final prediction. The two-stage model synthesis strategy can be applied to any network architectures and both stages are jointly trained. We also show that proper training recipes are critical for increasing generalizability for such high capacity neural networks. On ImageNet classification benchmark, our BasisNet with MobileNets as backbone demonstrated clear advantage on accuracy-efficiency trade-off over several strong baselines. Specifically, BasisNet-MobileNetV3 obtained 80.3% top-1 accuracy with only 290M Multiply-Add operations, halving the computational cost of previous state-of-the-art without sacrificing accuracy. With early termination, the average cost can be further reduced to 198M MAdds while maintaining accuracy of 80.0% on ImageNet. View details
    Preview abstract In this paper we introduce a novel method that enables parameter efficient transfer and multitask learning. We show that by reusing more than 95\% of the parameters we can re-purpose neural networks to solve very different types of problems such as going from COCO-dataset SSD detection to Imagenet classification. Our approach allows both simultaneous (e.g. multi-task) learning as well as sequential fine-tuning where we change the already trained networks to solve a different problem. We show that our approach leads to significant increase in accuracy when compared to traditional logits-only fine-tuning while using much fewer parameters. Interestingly, for multi-task learning our approach sometimes acts as a regularizer often leading to improved performance when compared to models trained on a single task. Our approach has multiple immediate applications. It can be used to dramatically increase the number of models available in resource-constrained settings, since the marginal cost of a new model is now less than 5\% of the full model. The constrained fine-tuning enables better generalization when limited amount data is available. We evaluate our approach on multiple datasets and multiple models. View details
    Preview abstract We explore the question of how the resolution of input image affects the performance of a neural network when compared to the resolution of hidden layers. Image resolution is frequently used as a hyper parameter providing a trade-off between model performance and accuracy. An intuitive interpretation is that the decay in accuracy when reducing input resolution, is caused by the reduced information content in the low-resolution input. Left unsaid often the fact that this also reduces the model's internal resolution. In this paper, we show that up-to a point the resolution plays very little role in the network performance. We show that another obvious hypothesis, such as changes in receptive fields, is not the primary root causes either. We then use this insight, to develop novel neural network architectures that we call {\it isometric neural networks} that maintain fixed internal resolution throughout their entire depth and demonstrate that it lead of high accuracy models with low activation footprint and a parameter count. \end{abstract} View details
    Preview abstract A new method for learning image attention masks in a semi-supervised setting based on the Information Bottleneck principle is proposed. Provided with a set of labeled images, the mask generation model is minimizing mutual information between the input and the masked image while maximizing the mutual information between the same masked image and the image label. In contrast with other approaches, our attention model produces a boolean rather than a continuous mask thus entirely concealing information from masked-out pixels. Using a set of synthetic datasets based on MNIST and CIFAR10 and a SVHN dataset, we demonstrate that our method can successfully attend to features defining the image class. We also discuss potential drawbacks of our methods and propose a mask randomization technique to alleviate one of them. View details