Adam Grycner
Research Areas
Authored Publications
Sort By
PaLI: A Jointly-Scaled Multilingual Language-Image Model
Piotr Padlewski
Daniel Salz
Sebastian Alexander Goodman
Basil Mustafa
Keran Rong
Hassan Akbari
Linting Xue
James Bradbury
Chao Jia
Carlos Riquelme
Neil Houlsby
International Conference on Learning Representations (ICLR) (2023)
Preview abstract
Effective scaling and a flexible task interface enable large-capacity language models to excel at many tasks. PaLI (Pathways Language and Image model) extends these ideas to the joint modeling of language and vision. PaLI is a model that generates text based on visual and textual inputs. Using this API, PaLI is able to perform many vision, language, and multimodal tasks, across many languages. We train PaLI with two main principles: reuse of pretrained unimodal components, and joint scaling of modalities. Using large-capacity pretrained language models and vision models allows us to capitalize on their existing capabilities, while leveraging the substantial cost of training them. We scale PaLI models across three axes:the language component, the vision component, and the training data that fuses them. For the vision component, we train the largest and best-performing VisionTransformer (ViT) to date. For the data, we build an image-text training set over10B images and covering over 100 languages.
PaLI inherits and enhances language-understanding capabilities, and achieves state-of-the-art in multiple vision and language tasks (image classification, image captioning, visual question-answering, scene-text understanding, etc.), based on a simple, modular, and reuse-friendly platform for modeling and scaling.
View details