Reinforcing an Image Caption Generator using Off-line Human Feedback

Paul Hongsuck Seo
Piyush Sharma
AAAI 2020 (2020)
Google Scholar

Abstract

Human ratings are currently the most accurate way to assess the quality of an image captioning model, yet most often the only used outcome of an expensive human rating evaluation is a few overall statistics over the evaluation dataset. In this paper, we show that the signal from instance-level human caption ratings can be leveraged to achieve improved captioning models, even when the amount of caption ratings is several orders of magnitude less than the caption training data. We employ a policy gradient method to maximize the human ratings as rewards in an off-policy reinforcement learning setting, using a technique that makes use of a sampling distribution that focuses on the captions that are present in a caption-ratings dataset. We present empirical evidence that indicates that our models learn to generalize the human raters’judgments in the caption-ratings training data to a previously unseen set of images, as judged by a different set of human judges and additionally on a different, multi-dimensional side-by-side human evaluation procedure.