Object class labeling is the task of annotating images with labels on the presence or absence of objects from a given class vocabulary. Simply asking one yes-no question per class, however, has a cost that is linear in the vocabulary size and is thus inefficient for large vocabularies. While modern approaches rely on a hierarchical organization of the vocabulary to reduce annotation time, they are still expensive (several minutes per image for the 200 classes in ILSVRC). Instead, we propose a new interface where classes are annotated via speech. Speaking is fast and allows for direct access to the class name, without searching through a list or hierarchy. As additional advantages, annotators can simultaneously speak and scan the image for objects, the interface can be kept extremely simple, and using it requires less mouse movement. However, a key challenge is to train annotators to only say words from the given class vocabulary. We present a way to tackle this challenge and show that our method yields high-quality annotations at significant speed gains (2.3−14.9x faster than existing methods).