Scaling Vision Transformers to 22 Billion Parameters

Mostafa Dehghani; Josip Djolonga; Basil Mustafa; Piotr Padlewski; Jonathan Heek; Justin Gilmer; Andreas Steiner; Mathilde Caron; Robert Geirhos; Ibrahim Alabdulmohsin; Rodolphe Jenatton; Lucas Beyer; Michael Tschannen; Anurag Arnab; Xiao Wang; Carlos Riquelme; Matthias Minderer; Joan Puigcerver; Utku Evci; Manoj Kumar; Sjoerd van Steenkiste; Gamaleldin Elsayed; Aravindh Mahendran; Fisher Yu; Avital Oliver; Fantine Huot; Jasmijn Bastings; Mark Collier; Alexey Gritsenko; Vighnesh Birodkar; Cristina Vasconcelos; Yi Tay; Thomas Mensink; Alexander Kolesnikov; Filip Pavetić; Dustin Tran; Thomas Kipf; Mario Lučić; Xiaohua Zhai; Daniel Keysers; Jeremiah Harmsen; Neil Houlsby

Scaling Vision Transformers to 22 Billion Parameters

Mostafa Dehghani

Josip Djolonga

Basil Mustafa

Piotr Padlewski

Jonathan Heek

Justin Gilmer

Andreas Steiner

Mathilde Caron

Robert Geirhos

Ibrahim Alabdulmohsin

Rodolphe Jenatton

Lucas Beyer

Michael Tschannen

Anurag Arnab

Xiao Wang

Carlos Riquelme

Matthias Minderer

Joan Puigcerver

Utku Evci

Manoj Kumar

Sjoerd van Steenkiste

Gamaleldin Elsayed

Aravindh Mahendran

Fisher Yu

Avital Oliver

Fantine Huot

Jasmijn Bastings

Mark Collier

Alexey Gritsenko

Vighnesh Birodkar

Cristina Vasconcelos

Yi Tay

Thomas Mensink

Alexander Kolesnikov

Filip Pavetić

Dustin Tran

Thomas Kipf

Mario Lučić

Xiaohua Zhai

Daniel Keysers

Jeremiah Harmsen

Neil Houlsby

Arxiv (2023)

Download Google Scholar

Abstract

The scaling of Transformers has driven breakthrough capabilities for language models.
At present, the largest large language models (LLMs) contain upwards of 100B parameters.
Vision Transformers (ViT) have introduced the same architecture to image and video modeling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters. We present a recipe for highly efficient training of a 22B-parameter ViT and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features) ViT22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between bias and performance, an improved alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT22B demonstrates the potential for "LLM-like'' scaling in vision, and provides key steps towards getting there.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Scaling Vision Transformers to 22 Billion Parameters

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs