Matthias Minderer
I am a Research Scientist at Google Brain. I’m interested in how neural representations of the world should be structured to make them learnable from sensory data with little supervision.
Before joining Google, I obtained a PhD in systems neuroscience in Christopher Harvey’s lab at Harvard. There, I studied how visual and action-related information is represented and distributed in the cortex.
Personal websiteResearch Areas
Authored Publications
Sort By
PaLI-X: On Scaling up a Multilingual Vision and Language Model
Josip Djolonga
Piotr Padlewski
Basil Mustafa
Carlos Riquelme
Sebastian Goodman
Yi Tay
Siamak Shakeri
Daniel Salz
Michael Tschannen
Mandar Joshi
Filip Pavetić
Gang Li
Anurag Arnab
Yuanzhong Xu
Keran Rong
Neil Houlsby
Computer Vision and Pattern Recognition Conference (CVPR) (2024)
Preview abstract
We explore the boundaries of scaling up a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. Our model advances the state-of-the-art on most vision-and-language benchmarks considered (20+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.
View details
Scaling Vision Transformers to 22 Billion Parameters
Josip Djolonga
Basil Mustafa
Piotr Padlewski
Justin Gilmer
Mathilde Caron
Rodolphe Jenatton
Michael Tschannen
Anurag Arnab
Carlos Riquelme
Gamaleldin Elsayed
Fisher Yu
Avital Oliver
Fantine Huot
Mark Collier
Vighnesh Birodkar
Yi Tay
Filip Pavetić
Thomas Kipf
Neil Houlsby
Arxiv (2023)
Preview abstract
The scaling of Transformers has driven breakthrough capabilities for language models.
At present, the largest large language models (LLMs) contain upwards of 100B parameters.
Vision Transformers (ViT) have introduced the same architecture to image and video modeling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters. We present a recipe for highly efficient training of a 22B-parameter ViT and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features) ViT22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between bias and performance, an improved alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT22B demonstrates the potential for "LLM-like'' scaling in vision, and provides key steps towards getting there.
View details
Decoder Denoising Pretraining for Semantic Segmentation
Emmanuel Asiedu Brempong
Simon Kornblith
Ting Chen
Niki Parmar
Mohammad Norouzi
Transactions of Machine Learning Research (2022)
Preview abstract
Semantic segmentation labels are expensive and time consuming to acquire. Hence, pretraining is commonly used to improve the label-efficiency of segmentation models. Typically, the encoder of a segmentation model is pretrained as a classifier and the decoder is randomly initialized. Here, we argue that random initialization of the decoder can be suboptimal, especially when few labeled examples are available. We propose a decoder pretraining approach based on denoising, which can be combined with supervised pretraining of the encoder. We find that decoder denoising pretraining on the ImageNet dataset strongly outperforms encoder-only supervised pretraining. Despite its simplicity, decoder denoising pretraining achieves state-of-the-art results on label-efficient semantic segmentation and offers considerable gains on the Cityscapes, Pascal Context, and ADE20K datasets.
View details
Simple Open-Vocabulary Object Detection with Vision Transformers
Austin Stone
Maxim Neumann
Dirk Weissenborn
Alexey Dosovitskiy
Anurag Arnab
Zhuoran Shen
Thomas Kipf
Neil Houlsby
ECCV (Poster) (2022)
Preview abstract
Combining simple architectures with large-scale pre-training has led to massive improvements in image classification. For object detection, pre-training and scaling approaches are less well established, especially in the long-tailed and open-vocabulary setting, where training data is relatively scarce. In this paper, we propose a strong recipe for transferring image-text models to open-vocabulary object detection. We use a standard Vision Transformer architecture with minimal modifications, contrastive image-text pre-training, and end-to-end detection fine-tuning. Our analysis of the scaling properties of this setup shows that increasing image-level pre-training and model size yield consistent improvements on the downstream detection task. We provide the adaptation strategies and regularizations needed to attain very strong performance on zero-shot text-conditioned and one-shot image-conditioned object detection. Code and models are available on GitHub (https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit).
View details
On Robustness and Transferability of Convolutional Neural Networks
Josip Djolonga
Jessica Yung
Michael Tschannen
Rob Romijnders
Dan Moldovan
Sylvain Gelly
Neil Houlsby
Conference on Computer Vision and Pattern Recognition (2021)
Preview abstract
Modern deep convolutional networks (CNNs) are often criticized for their failure to generalize under distributional shifts. However, several recent breakthroughs in transfer learning suggest that these networks can cope with severe distribution shifts and successfully adapt to new tasks from a few training examples. In this work we revisit the out-of-distribution and transfer performance of modern image classification CNNs and investigate the impact of the pre-training data scale, the model scale, and the data preprocessing pipeline. We find that increasing both the training set and model sizes significantly improve the robustness to distribution shifts. Furthermore, we show that, perhaps surprisingly, simple changes in the preprocessing such as modifying the image resolution can significantly mitigate robustness issues in some cases. Finally, we outline the shortcomings of existing robustness evaluation datasets and introduce a synthetic dataset for fine-grained robustness analysis.
View details
Revisiting the Calibration of Modern Neural Networks
Josip Djolonga
Rob Romijnders
Frances Ann Hubis
Neil Houlsby
Neural Information Processing Systems (2021) (to appear)
Preview abstract
Accurate estimation of predictive uncertainty (model calibration) is essential for the safe application of neural networks. Many instances of miscalibration in modern neural networks have been reported, suggesting a trend that newer, more accurate models produce poorly calibrated predictions. Here, we revisit this question for recent state-of-the-art image classification models. We systematically relate model calibration and accuracy, and find that the most recent models, notably those not using convolutions, are among the best calibrated. Trends observed in prior model generations, such as decay of calibration with distribution shift or model size, are less pronounced in recent architectures. We also show that model size and amount of pretraining do not fully explain these differences, suggesting that architecture is a major determinant of calibration properties.
View details
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Dirk Weissenborn
Jakob Uszkoreit
Neil Houlsby
Sylvain Gelly
Thomas Unterthiner
ICLR (2021)
Preview abstract
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision tasks, attention is usually either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks, while keeping their overall structure in place. We show that this reliance on ConvNets is not necessary and a pure transformer can perform very well on image classification tasks when applied directly to sequences of image patches. When pre-trained on large amounts of data and transferred to multiple recognition benchmarks (ImageNet, CIFAR-10, etc), these transformers attain excellent accuracy, matching or outperforming the best convolutional networks while requiring substantially less computational resources to train.
View details
Preview abstract
Extracting and predicting object structure and dynamics from videos without
supervision is a major challenge in machine learning. To address this challenge,
we adopt a keypoint-based image representation and learn a stochastic dynamics
model of the keypoints. Future frames are reconstructed from the keypoints and
a reference frame. By modeling dynamics in the keypoint coordinate space, we
achieve stable learning and avoid compounding of errors in pixel space. Our
method improves upon unstructured representations both for pixel-level video
prediction and for downstream tasks requiring object-level understanding of motion
dynamics. We evaluate our model on diverse datasets: a multi-agent sports dataset,
the Human3.6M dataset, and datasets based on continuous control tasks from
the DeepMind Control Suite. The spatially structured representation outperforms
unstructured representations on a range of motion-related tasks such as object
tracking, action recognition and reward prediction.
View details