Fast, Accurate Detection of 100,000 Object Classes on a Single Machine

June 27, 2013

Posted by Tom Dean, Google Research

Humans can distinguish among approximately 10,000 relatively high-level visual categories, but we can discriminate among a much larger set of visual stimuli referred to as features. These features might correspond to object parts, animal limbs, architectural details, landmarks, and other visual patterns we don’t have names for, and it is this larger collection of features we use as a basis with which to reconstruct and explain our day-to-day visual experience. Such features provide the components for more complicated visual stimuli and establish a context essential for us to resolve ambiguous scenes.

Contrary to current practice in computer vision, the explanatory context required to resolve a visual detail may not be entirely local. A flash of red bobbing along the ground might be a child’s toy in the context of a playground or a rooster in the context of a farmyard. It would be useful to have a large number of feature detectors capable of signaling the presence of such features, including detectors for sandboxes, swings, slides, cows, chickens, sheep and farm machinery necessary to establish the context for distinguishing between these two possibilities.

This year’s winner of the CVPR Best Paper Award, co-authored by Googlers Tom Dean, Mark Ruzon, Mark Segal, Jonathon Shlens, Sudheendra Vijayanarasimhan and Jay Yagnik, describes technology that will enable computer vision systems to extract the sort of semantically rich contextual information required to recognize visual categories even when a close examination of the pixels spanning the object in question might not be sufficient for identification in the absence of such contextual clues. Specifically, we consider a basic operation in computer vision that involves determining for each location in an image the degree to which a particular feature is likely to be present in the image at that particular location.

This so-called convolution operator is one of the key operations used in computer vision and, more broadly, all of signal processing. Unfortunately, it is computationally expensive and hence researchers use it sparingly or employ exotic SIMD hardware like GPUs and FPGAs to mitigate the computational cost. We turn things on their head by showing how one can use fast table lookup — a method called hashing — to trade time for space, replacing the computationally-expensive inner loop of the convolution operator — a sequence of multiplications and additions — required for performing millions of convolutions with a single table lookup.

We demonstrate the advantages of our approach by scaling object detection from the current state of the art involving several hundred or at most a few thousand of object categories to 100,000 categories requiring what would amount to more than a million convolutions. Moreover, our demonstration was carried out on a single commodity computer requiring only a few seconds for each image. The basic technology is used in several pieces of Google infrastructure and can be applied to problems outside of computer vision such as auditory signal processing.

On Wednesday, June 26, the Google engineers responsible for the research were awarded Best Paper at a ceremony at the IEEE Conference on Computer Vision and Pattern Recognition held in Portland Oregon. The full paper can be found here.