Robert Geirhos
Robert is a Research Scientist at Google DeepMind, located in Toronto. He previously worked in the Wichmann, Bethge and Brendel labs at the University of Tübingen & the International Max Planck Research School for Intelligent Systems. Robert's work focuses on achieving understanding breakthroughs in the areas of robustness, interpretability, and human-machine comparisons.
For more information, see Robert's personal website.
Research Areas
Authored Publications
Sort By
Scaling Vision Transformers to 22 Billion Parameters
Josip Djolonga
Basil Mustafa
Piotr Padlewski
Justin Gilmer
Mathilde Caron
Rodolphe Jenatton
Michael Tschannen
Anurag Arnab
Carlos Riquelme
Gamaleldin Elsayed
Fisher Yu
Avital Oliver
Fantine Huot
Mark Collier
Vighnesh Birodkar
Yi Tay
Filip Pavetić
Thomas Kipf
Neil Houlsby
Arxiv (2023)
Preview abstract
The scaling of Transformers has driven breakthrough capabilities for language models.
At present, the largest large language models (LLMs) contain upwards of 100B parameters.
Vision Transformers (ViT) have introduced the same architecture to image and video modeling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters. We present a recipe for highly efficient training of a 22B-parameter ViT and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features) ViT22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between bias and performance, an improved alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT22B demonstrates the potential for "LLM-like'' scaling in vision, and provides key steps towards getting there.
View details
Preview abstract
Deep neural networks (DNNs) are machine learning algorithms that have revolutionized computer vision due to their remarkable successes in tasks like object classification and segmentation. The success of DNNs as computer vision algorithms has led to the suggestion that DNNs may also be good models of human visual perception. In this article, we review evidence regarding current DNNs as adequate behavioral models of human core object recognition. To this end, we argue that it is important to distinguish between statistical tools and computational models and to understand model quality as a multidimensional concept in which clarity about modeling goals is key. Reviewing a large number of psychophysical and computational explorations of core object recognition performance in humans and DNNs, we argue that DNNs are highly valuable scientific tools but that, as of today, DNNs should only be regarded as promising—but not yet adequate—computational models of human core object recognition behavior. On the way, we dispel several myths surrounding DNNs in vision science.
View details