CountQA: How Well Do MLLMs Count in the Wild?

Jayant Tamarapalli; Rynaa Grover; Nilay Pande; Sahiti Yerramilli

CountQA: How Well Do MLLMs Count in the Wild?

Jayant Tamarapalli

Rynaa Grover

Nilay Pande

Sahiti Yerramilli

arXiv preprint arXiv:2508.06585 (2025)

Download Google Scholar

Listen with Illuminate

Abstract

While Multimodal Large Language Models (MLLMs) display a remarkable fluency in describing visual scenes, their ability to perform the fundamental task of object counting remains poorly understood. This paper confronts this issue by introducing CountQA, a challenging new benchmark composed of over 1,500 question-answer pairs centered on images of everyday, real-world objects, often in cluttered and occluded arrangements. Our evaluation of 15 prominent MLLMs on CountQA systematically investigates this weakness, revealing a critical failure of numerical grounding: the models consistently struggle to translate raw visual information into an accurate quantity. By providing a dedicated tool to probe this foundational weakness, CountQA paves the way for the development of more robust and truly capable MLLMs that are spatially aware and numerically grounded.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

CountQA: How Well Do MLLMs Count in the Wild?

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs