Context-aware Captions from Context-agnostic Supervision

Shanmukha Ramakrishna Vedantam
Samy Bengio
Devi Parikh
Gal Chechik


We describe a model to induce discriminative image captions based only on generative ground-truth training data. For example, given images and descriptions of “zebras” and “horses”, our system can generate discriminative language that describes the zebra images while capturing the differences with the “horse” images . Producing discriminative language is a foundational problem in the study of pragmatic behavior: Humans can effortlessly repurpose language for being persuasive and effective in communication. We first propose a novel inference procedure based on a reflex speaker and an introspector to induce discrimination between concepts. Intuitively, the reflex speaker models a good utterance for some concept (“zebra”), while the introspector models how discriminative the sentence is between the concepts (“zebra” and “horse”). Unlike previous approaches, the form of our listener has the attractive property of being amenable to joint approximate inference to select utterances that satisfy both the speaker and the introspector, yielding an introspective speaker. We apply our introspective speaker to the CUB-Text dataset to describe why an image contains a particular bird category as opposed to some other closely related bird category and to the MS COCO dataset to generate language that points to one out two semantically similar images. Evaluations with discriminative ground truth collected on CUB and with humans on MSCOCO reveal that our approach outperforms baseline approaches for discrimination. We then draw qualitative insights from our model outputs which suggest that in some cases one may interpret the introspective speaker outputs to be lies in service of the higher goal of discrimination.