All You May Need for VQA are Image Captions

Beer Changpinyo; Doron Kukliansky; Idan Szpektor; Xi Chen; Nan Ding; Radu Soricut

All You May Need for VQA are Image Captions

Beer Changpinyo

Doron Kukliansky

Idan Szpektor

Xi Chen

Nan Ding

Radu Soricut

NAACL (2022)

Download Google Scholar

Abstract

Visual Question Answering (VQA) has benefited from increasingly sophisticated models, but has not enjoyed the same level of engagement in terms of data creation. In this paper, we propose a method that automatically derives VQA examples at volume, by leveraging the abundance of existing image-caption annotations combined with neural models for textual question generation. We show that the resulting data is of high-quality. VQA models trained on our data improve state-of-the-art zero-shot accuracy by double digits and achieve a level of robustness that lacks in the same model trained on human-annotated VQA data.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

All You May Need for VQA are Image Captions

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs