Dynamic Pre-training of Vision-Language Models

AJ Piergiovanni; Weicheng Kuo; Wei Li; Anelia Angelova

Dynamic Pre-training of Vision-Language Models

AJ Piergiovanni

Weicheng Kuo

Wei Li

Anelia Angelova

ICLR 2023 Workshop on Multimodal Representation Learning (2023)

Google Scholar

Abstract

Vision-Language pretraining aims to learn universal cross-modal representations and to create models with broad capabilities. In this paper, we propose a novel dynamic pretraining resampling for a variety of pretraining tasks. Unlike recent large-scale vision-language approaches, we show that a set of diverse self- and weakly-supervised pretraining tasks dynamically sampled according to task difficulty provides strong performance. Further, the approach is sample-efficient, using much less data and compute to address a range of downstream tasks. We show that a single 330M pretrained model using only smaller and publicly accessible datasets, achieves competitive or SOTA performance on three diverse groups of tasks: visual question answering, text-based image localization by referring expressions, and video question answering.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Dynamic Pre-training of Vision-Language Models

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs