MLP-Mixer: An All-MLP Architecture for Vision

Ilya Tolstikhin; Neil Houlsby; Alexander Kolesnikov; Lucas Beyer; Xiaohua Zhai; Thomas Unterthiner; Jessica Yung; Andreas Steiner; Daniel Martin Keysers; Jakob Uszkoreit; Mario Lučić; Alexey Dosovitskiy

MLP-Mixer: An All-MLP Architecture for Vision

Ilya Tolstikhin

Neil Houlsby

Alexander Kolesnikov

Lucas Beyer

Xiaohua Zhai

Thomas Unterthiner

Jessica Yung

Andreas Steiner

Daniel Martin Keysers

Jakob Uszkoreit

Mario Lučić

Alexey Dosovitskiy

NeurIPS 2021 (poster)

Download Google Scholar

Listen with Illuminate

Abstract

Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. "mixing" the per-location features), and one with MLPs applied across patches (i.e. "mixing" spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks with comparable pre-training and inference cost. We hope that these results spark further research beyond the realms of well established CNNs and Transformers.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

MLP-Mixer: An All-MLP Architecture for Vision

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs