Google Research

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

Abstract

We present a new dataset of image caption annotations, CHIA, which contains an order of magnitude more images than the MS-COCO dataset and represents a wider variety of both image and image caption styles. We achieve this by extracting and filtering image caption annotations from billions of Internet webpages. We also present quantitative evaluations of a number of image captioning models and show that a model architecture based on Inception-ResNet-v2 CNN for image-feature extraction and Transformer for sequence modeling achieves best performance when trained on the CHIA dataset.

We present a new dataset of image caption annotations, Conceptual Captions, which contains an order of magnitude more images than the MS-COCO dataset and represents a wider variety of both images and image caption styles. We achieve this by extracting and filtering image caption annotations from billions of webpages. We also present quantitative evaluations of a number of image captioning models and show that a model architecture based on Inception-ResNet-v2 for image-feature extraction and Transformer for sequence modeling achieves the best performance when trained on the Conceptual Captions dataset.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work