Teaching CLIP to Count to Ten

Michal Irani
Roni Paiss
Shiran Zada
Submission to CVPR 2023 (2023)

Abstract

Large vision-language models, such as CLIP, learn robust representations of text and images, facilitating advances in many downstream tasks, including zero-shot classification and text-to-image generation. However, these models have several well-documented limitations. They fail to encapsulate compositional concepts, such as counting objects in an image or the relations between objects.
To the best of our knowledge, this work is the first to extend CLIP to handle object counting. We introduce a simple yet effective method to improve the quantitative understanding of vision-language models, while maintaining their overall performance on common benchmarks.
Our method automatically augments image captions to create hard negative samples that differ from the original captions by only the number of objects. For example, an image of three dogs can be contrasted with the negative caption "Six dogs playing in the yard". A dedicated loss encourages discrimination between the correct caption and its negative variant.
We introduce CountBench, a new benchmark for evaluating a model's understanding of object counting, and demonstrate significant improvement over baseline models on this task. Furthermore, we leverage our improved CLIP representations for image generation, and show that our model can produce specific counts of objects more reliably than existing ones.

Research Areas