Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment

Brian Gordon; Dani Lischinski; Daniel Cohen-Or; Idan Szpektor; Roopal Garg; Xi Chen; Yonatan Bitton

Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment

Brian Gordon

Dani Lischinski

Daniel Cohen-Or

Idan Szpektor

Roopal Garg

Xi Chen

Yonatan Bitton

arXiv (2023)

Download Google Scholar

Abstract

While existing image/text alignment models reach high quality binary assessments, they fall short of pinpointing the exact source of misalignment.
In this paper, we present a method to provide detailed textual and visual explanation of detected misalignments between text/image pairs.
We leverage large language models to automatically construct a training set that holds plausible misaligned captions for a given image and corresponding textual explanations and visual indicators. We also introduce a new human curated test set comprising ground-truth textual and visual misalignment annotations. Empirical results show that fine-tuning vision language models on our training set enables them to articulate misalignments and visually indicate them within images, outperforming strong baselines both on the binary alignment classification and the explanation generation tasks.

Research Areas

Natural language processing

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs