PaLI-X: On Scaling up a Multilingual Vision and Language Model

Xi Chen; Josip Djolonga; Piotr Padlewski; Basil Mustafa; Beer Changpinyo; Jialin Wu; Carlos Riquelme; Sebastian Goodman; Xiao Wang; Yi Tay; Siamak Shakeri; Mostafa Dehghani; Daniel Salz; Mario Lučić; Michael Tschannen; Arsha Nagrani; Hexiang (Frank) Hu; Mandar Joshi; Bo Pang; Ceslee Montgomery; Paulina Pietrzyk; Marvin Ritter; AJ Piergiovanni; Matthias Minderer; Filip Pavetić; Austin Waters; Gang Li; Ibrahim Alabdulmohsin; Lucas Beyer; Julien Amelot; Kenton Lee; Andreas Steiner; Yang Li; Daniel Keysers; Anurag Arnab; Yuanzhong Xu; Keran Rong; Alexander Kolesnikov; Mojtaba Seyedhosseini; Anelia Angelova; Xiaohua Zhai; Neil Houlsby; Radu Soricut

PaLI-X: On Scaling up a Multilingual Vision and Language Model

Xi Chen

Josip Djolonga

Piotr Padlewski

Basil Mustafa

Beer Changpinyo

Jialin Wu

Carlos Riquelme

Sebastian Goodman

Xiao Wang

Yi Tay

Siamak Shakeri

Mostafa Dehghani

Daniel Salz

Mario Lučić

Michael Tschannen

Arsha Nagrani

Hexiang (Frank) Hu

Mandar Joshi

Bo Pang

Ceslee Montgomery

Paulina Pietrzyk

Marvin Ritter

AJ Piergiovanni

Matthias Minderer

Filip Pavetić

Austin Waters

Gang Li

Ibrahim Alabdulmohsin

Lucas Beyer

Julien Amelot

Kenton Lee

Andreas Steiner

Yang Li

Daniel Keysers

Anurag Arnab

Yuanzhong Xu

Keran Rong

Alexander Kolesnikov

Mojtaba Seyedhosseini

Anelia Angelova

Xiaohua Zhai

Neil Houlsby

Radu Soricut

Computer Vision and Pattern Recognition Conference (CVPR) (2024)

Download Google Scholar

Abstract

We explore the boundaries of scaling up a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. Our model advances the state-of-the-art on most vision-and-language benchmarks considered (20+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

PaLI-X: On Scaling up a Multilingual Vision and Language Model

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs