- Mostafa Dehghani
- Josip Djolonga
- Basil Mustafa
- Piotr Padlewski
- Jonathan Heek
- Justin Gilmer
- Andreas Steiner
- Mathilde Caron
- Robert Geirhos
- Ibrahim Alabdulmohsin
- Rodolphe Jenatton
- Lucas Beyer
- Michael Tschannen
- Anurag Arnab
- Xiao Wang
- Carlos Riquelme
- Matthias Minderer
- Joan Puigcerver
- Utku Evci
- Manoj Kumar
- Sjoerd van Steenkiste
- Gamaleldin Elsayed
- Aravindh Mahendran
- Fisher Yu
- Avital Oliver
- Fantine Huot
- Jasmijn Bastings
- Mark Collier
- Alexey Gritsenko
- Vighnesh Birodkar
- Cristina Vasconcelos
- Yi Tay
- Thomas Mensink
- Alexander Kolesnikov
- Filip Pavetić
- Dustin Tran
- Thomas Kipf
- Mario Lučić
- Xiaohua Zhai
- Daniel Keysers
- Jeremiah Harmsen
- Neil Houlsby
Abstract
The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modeling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters. We present a recipe for highly efficient training of a 22B-parameter ViT and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features) ViT22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between bias and performance, an improved alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT22B demonstrates the potential for "LLM-like'' scaling in vision, and provides key steps towards getting there.
Research Areas
Learn more about how we do research
We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work