Jump to Content
Tom Duerig

Tom Duerig

Tom Duerig is a Software engineer and manager within Perception's Media Understanding group in Google Research. Before joining Perception, he worked on Image Search both on image content understanding for ranking and Reverse Image Search. From 2007 to 2009 he was a full-stack engineer on Google's Custom Search Engines. He received a master's degree from UC San Diego where he studied high performance computing and computer vision and a Bachelor's degree in Computer Science from UC Santa Cruz. He is passionate about Computer Vision, creative sources of machine learning supervision and high performance large-scale systems.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as ImageNet or OpenImages. For vision-language, popular datasets like Conceptual Captions, MSCOCO, or CLIP all involve a non-trivial data collection (and cleaning) process. This costly curation process limits the size of datasets and hence hinders the scaling of trained models. In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual representation achieves strong performance when transferred to classification tasks such as ImageNet and VTAB. The aligned visual and language representations also set new state-of-the-art results on Flickr30K and MSCOCO benchmarks, even when compared with more sophisticated cross-attention models. The representations also enable cross-modality search with complex text and text + image queries. View details
    Preview abstract We present Open Images V4, a dataset of 9.2M images with unified annotations for image classification, object detection and visual relationship detection. The images have a Creative Commons Attribution license that allows to share and adapt the material, and they have been collected from Flickr without a predefined list of class names or tags, leading to natural class statistics and avoiding an initial design bias. Open Images V4 offers large scale across several dimensions: 30.1M image-level labels for 19.8k concepts, 15.4M bounding boxes for 600 object classes, and 375k visual relationship annotations involving 57 classes. For object detection in particular, we provide 15x more bounding boxes than the next largest datasets (15.4M boxes on 1.9M images). The images often show complex scenes with several objects (8 annotated objects per image on average). We annotated visual relationships between them, which support visual relationship detection, an emerging task that requires structured reasoning. We provide in-depth comprehensive statistics about the dataset, we validate the quality of the annotations, we study how the performance of several modern models evolves with increasing amounts of training data, and we demonstrate two applications made possible by having unified annotations of multiple types coexisting in the same images. We hope that the scale, quality, and variety of Open Images V4 will foster further research and innovation even beyond the areas of image classification, object detection, and visual relationship detection. View details
    Graph-RISE: Graph-Regularized Image Semantic Embedding
    Aleksei Timofeev
    Futang Peng
    Krishnamurthy Viswanathan
    Lucy Gao
    Sujith Ravi
    Yi-ting Chen
    Zhen Li
    The 12th International Conference on Web Search and Data Mining (2020) (to appear)
    Preview abstract Learning image representation to capture instance-based semantics has been a challenging and important task for enabling many applications such as image search and clustering. In this paper, we explore the limits of image embedding learning at unprecedented scale and granularity. We present Graph-RISE, an image embedding that captures very fine-grained, instance-level semantics. Graph-RISE is learned via a large-scale, neural graph learning framework that leverages graph structure to regularize the training of deep neural networks. To the best of our knowledge, this is the first work that can capture instance-level image semantics at million—O(40M)—scale. Experimental results show that Graph-RISE outperforms state-of-the-art image embedding algorithms on several evaluation tasks, including image classification and triplet ranking. We also provide case studies to demonstrate that, qualitatively, image retrieval based on Graph-RISE well captures the semantics and differentiates nuances at instance level. View details
    Preview abstract Most deep architectures for image classification – even those that are trained to classify a large number of diverse categories – learn shared image representations with a single combined model. Intuitively, however, categories that are more visually similar should share more information than those that are very different. While hierarchical deep networks address this problem by learning separate features for subsets of related categories, current implementations require simplified models using fixed architectures specified with heuristic clustering methods. Instead, we propose Blockout, a method for regularization and model selection that simultaneously learns both the model architecture and parameters jointly with end-to-end training. Inspired by dropout, our approach gives a novel parametrization of hierarchical architectures that allows for structure learning using simple back-propagation. To demonstrate the utility of our approach, we evaluate Blockout on the CIFAR and ImageNet datasets demonstrating improved classification accuracy, better regularization performance, faster training, and a clear separation of nodes into hierarchical structures. View details
    The Unreasonable Effectiveness of Noisy Data for Fine-Grained Recognition
    Andrew Howard
    Alexander Toshev
    James Philbin
    Li Fei-Fei
    Computer Vision and Pattern Recognition (2016)
    Preview abstract Current approaches for fine-grained recognition do the following: First, recruit experts to annotate a dataset of images, optionally also collecting more structured data in the form of part annotations and bounding boxes. Second, train a model utilizing this data. Toward the goal of solving fine-grained recognition, we introduce an alternative approach, leveraging free, noisy data from the web and simple, generic methods of recognition. This approach has benefits in both performance and scalability. We demonstrate its efficacy on four fine-grained datasets, greatly exceeding existing state of the art without the manual collection of even a single label, and furthermore show first results at scaling to more than 10,000 fine-grained categories. Quantitatively, we achieve top-1 accuracies of 92.3% on CUB-200-2011, 85.4% on Birdsnap, 93.4% on FGVC-Aircraft, and 80.8% on Stanford Dogs without using their annotated training sets. We compare our approach to an active learning approach for expanding fine-grained datasets. View details
    No Results Found