Understanding Image and Text Simultaneously: a Dual Vision-Language Machine Comprehension Task

Nan Ding

Sebastian Goodman

Fei Sha

Radu Soricut

Arxiv, https://arxiv.org/abs/1612.07833 (2016)

Google Scholar

Abstract

We describe a new multi-modal task for computer systems, posed as a combined vision-language comprehension challenge: identify the most suitable \emph{text} describing a scene, given several similar options. Accomplishing the task entails demonstrating comprehension beyond just recognizing ``keywords'' (or key-phrases) and their corresponding visual concepts, and instead
requires an alignment between the representations of the two modalities that achieves a visually-grounded ``understanding'' of various linguistic elements and their dependencies. This new task also admits an easy-to-compute and well-understood metric: the accuracy in detecting the true target among the decoys.

The paper makes several contributions: a generic mechanism for generating decoys from (human-created) image captions; an instance of applying this mechanism, yielding a large-scale machine comprehension dataset (based on the COCO images and captions) that we make publicly available;
results on a human evaluation on this dataset, thus providing a performance ceiling; and several baseline and competitive learning approaches that illustrate the utility of the proposed framework in advancing both image and language machine comprehension. In particular, there is a large gap between human performance and state-of-the-art learning methods, suggesting a fruitful direction for future research.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Understanding Image and Text Simultaneously: a Dual Vision-Language Machine Comprehension Task

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs