COCO-Stuff: Thing and Stuff Classes in Context
Abstract
Semantic classes can be either things (objects with a
well-defined shape, e.g. car, person) or stuff (amorphous
background regions, e.g. grass, sky). While lots of classifi-
cation and detection works focus on thing classes, less at-
tention has been given to stuff classes. Nonetheless, stuff
classes are important as they allow to explain important
aspects of an image, including (1) scene type; (2) which
thing classes are likely to be present and their location
(through contextual reasoning); (3) physical attributes, ma-
terial types and geometric properties of the scene. To un-
derstand stuff and things in context we introduce COCO-
Stuff, which augments 120,000 images of the COCO dataset
with pixel-wise annotations for 91 stuff classes. We introduce an efficient stuff
annotation protocol based on superpixels which leverages
the original thing annotations. We quantify the speed versus
quality trade-off of our protocol and explore the relation be-
tween annotation time and boundary complexity. Further-
more, we use COCO-Stuff to analyze: (a) the importance of
stuff and thing classes in terms of their surface cover and
how frequently they are mentioned in image captions; (b)
the spatial relations between stuff and things, highlighting
the rich contextual relations that make our dataset unique;
(c) the performance of a modern semantic segmentation
method on stuff and thing classes, and whether stuff is
easier to segment than things.
well-defined shape, e.g. car, person) or stuff (amorphous
background regions, e.g. grass, sky). While lots of classifi-
cation and detection works focus on thing classes, less at-
tention has been given to stuff classes. Nonetheless, stuff
classes are important as they allow to explain important
aspects of an image, including (1) scene type; (2) which
thing classes are likely to be present and their location
(through contextual reasoning); (3) physical attributes, ma-
terial types and geometric properties of the scene. To un-
derstand stuff and things in context we introduce COCO-
Stuff, which augments 120,000 images of the COCO dataset
with pixel-wise annotations for 91 stuff classes. We introduce an efficient stuff
annotation protocol based on superpixels which leverages
the original thing annotations. We quantify the speed versus
quality trade-off of our protocol and explore the relation be-
tween annotation time and boundary complexity. Further-
more, we use COCO-Stuff to analyze: (a) the importance of
stuff and thing classes in terms of their surface cover and
how frequently they are mentioned in image captions; (b)
the spatial relations between stuff and things, highlighting
the rich contextual relations that make our dataset unique;
(c) the performance of a modern semantic segmentation
method on stuff and thing classes, and whether stuff is
easier to segment than things.