CountQA: How Well Do MLLMs Count in the Wild?

Jayant Tamarapalli
Rynaa Grover
Nilay Pande
Sahiti Yerramilli
arXiv preprint arXiv:2508.06585 (2025)

Abstract

While Multimodal Large Language Models (MLLMs) display a remarkable fluency in describing visual scenes, their ability to perform the fundamental task of object counting remains poorly understood. This paper confronts this issue by introducing CountQA, a challenging new benchmark composed of over 1,500 question-answer pairs centered on images of everyday, real-world objects, often in cluttered and occluded arrangements. Our evaluation of 15 prominent MLLMs on CountQA systematically investigates this weakness, revealing a critical failure of numerical grounding: the models consistently struggle to translate raw visual information into an accurate quantity. By providing a dedicated tool to probe this foundational weakness, CountQA paves the way for the development of more robust and truly capable MLLMs that are spatially aware and numerically grounded.

Research Areas

Follow us

×