Jump to Content
Max Vladymyrov

Max Vladymyrov

Research Areas

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Transformers learn in-context by gradient descent
    Johannes von Oswald
    João Sacramento
    International Conference on Machine Learning (2023), pp. 35151-35174
    Preview abstract Transformers have become the state-of-the-art neural network architecture across numerous domains of machine learning. This is partly due to their celebrated ability to transfer and to learn in-context based on a few examples. Nevertheless, the mechanism of why and how Transformers become in-context learners is not well understood and remains mostly an intuition. Here, we argue that training Transformers on auto-regressive tasks can be closely related to well-known gradient-based meta-learning formulations. We do so by providing a simple construction that shows the equivalence of data transformations induced by 1) a single linear self-attention layer and by 2) gradient-descent on a regression loss. Motivated by that construction, we show empirically that when training self-attention only Transformers on simple regression tasks either the models learned by GD and Transformers show great similarity or, remarkably, the solutions found by gradient descent converge in weight space to our construction. This allows us, at least on our simple regression tasks, to mechanistically understand the inner workings of Transformers that enables in-context learning within. Finally, we discuss intriguing parallels to a mechanism identified as crucial for in-context learning termed induction-head (Olsson et al., 2022) and show how it could be generalized by in-context learning by gradient descent within Transformers. View details
    Preview abstract The architecture and the parameters of neural networks are often optimized independently, which requires costly retraining of the parameters whenever the architecture is modified. In this work we instead focus on growing the architecture without requiring costly retraining. We present a method that adds new neurons during training without impacting what is already learned, while improving the training dynamics. We achieve the latter by maximizing the gradients of the new weights and find the optimal initialization efficiently by means of the singular value decomposition (SVD). We call this technique Gradient Maximizing Growth (GradMax) and demonstrate its effectiveness in variety of vision tasks and architectures. View details
    Preview abstract In this paper we propose augmenting Vision Transformer models with learnable memory tokens. Our approach allows the model to adapt to new tasks, using few parameters, while optionally preserving its capabilities on previously learned tasks. At each layer we introduce a set of learnable embedding vectors that provide contextual information useful for specific datasets. We call these "memory tokens". We show that augmenting a model with just a handful of such tokens per layer significantly improves accuracy when compared to conventional head-only fine-tuning, and performs only slightly below the significantly more expensive full fine-tuning. We then propose an attention-masking approach that enables extension to new downstream tasks, with a computation reuse. In this setup in addition to being parameters efficient, models can execute both old and new tasks as a part of single inference at a small incremental cost. View details
    Preview abstract In this work we propose a HyperTransformer, a Transformer-based model for supervised and semi-supervised few-shot learning that generates weights of a convolutional neural network (CNN) directly from support samples. Since the dependence of a small generated CNN model on a specific task is encoded by a high-capacity Transformer model, we effectively decouple the complexity of the large task space from the complexity of individual tasks. Our method is particularly effective for small target CNN architectures where learning a fixed universal task-independent embedding is not optimal and better performance is attained when the information about the task can modulate all model parameters. For larger models we discover that generating the last layer alone allows us to produce competitive or better results than those obtained with state-of-the-art methods while being end-to-end differentiable. View details
    Preview abstract In this paper, we introduce a new type of generalized neural network where neurons and synapses maintain multiple states. We show that classical gradient-based backpropagation in neural networks can be seen as a special case of a two-state network where one state is used for activations and another for gradients, with update rules derived from the chain rule. In our generalized framework, networks have neither explicit notion of nor ever receive gradients. The synapses and neurons are updated using a bidirectional Hebb-style update rule parameterized by a shared low-dimensional "genome". We show that such genomes can be meta-learned from scratch, using either conventional optimization techniques, or evolutionary strategies, such as CMA-ES. Resulting update rules generalize to unseen tasks and train faster than gradient descent based optimizers for several standard computer vision and synthetic tasks. View details
    Preview abstract ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain. View details
    No Pressure! Addressing Problem of Local Minima in Manifold Learning
    33th Annual Conference on Neural Information Processing Systems (2019)
    Preview abstract Nonlinear embedding manifold learning methods provide an invaluable visual insights into a structure of the high-dimensional data. However due to a complicated nonlinear objective function, these methods can be easily stuck in local minima and their embedding quality can be poor. We propose a natural extension to several manifold learning methods aimed at identifying pressured points, i.e. points that stuck in the poor local minima and have poor embedding quality. We show that the pressure can be decreased by temporarily allowing these points to make use of an extra dimension in the embedding space. In the evaluation we show that our method is able to improve the objective function value of existing methods even after they get stuck in a poor local minimum. View details
    No Results Found