Stefan Popov

I work with Vittorio Ferrari on 3D reconstruction from a single image, 3D reconstruction from video, and human-machine collaboration for large scale image and video annotation. I received my PhD in 2012 from the University of Saarland under supervision of Prof. Dr.-Ing. Philipp Slusallek, working on real-time ray tracing and global illumination for commodity hardware. I did a postdoc (2012-2014) at INRIA Sophia Antipolis (France) under the supervision of George Drettakis. I worked on photo-realistic rendering in cooperation with Fredo Durand (MIT) and Ravi Ramamoorthi (UC Berkley). I joined Google in 2014 and worked on the Google's search engine until 2016.

Research Areas

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    NAVI: Category-Agnostic Image Collections with High-Quality 3D Shape and Pose Annotations
    Varun Jampani
    Andreas Engelhardt
    Arjun Karpur
    Karen Truong
    Kyle Sargent
    Ricardo Martin-Brualla
    Kaushal Patel
    Daniel Vlasic
    Vittorio Ferrari
    Ce Liu
    Neural Information Processing Systems (NeurIPS)(2023)
    Preview abstract Recent advances in neural reconstruction enable high-quality 3D object reconstruction from casually captured image collections. Current techniques mostly analyze their progress on relatively simple image collections where SfM techniques can provide ground-truth (GT) camera poses. We note that SfM techniques tend to fail on in-the-wild image collections such as image search results with varying backgrounds and illuminations. To enable systematic research progress on 3D reconstruction from casual image captures, we propose a new dataset of image collections called `NAVI' consisting of category-agnostic image collections of objects with high-quality 3D scans along with per-image 2D-3D alignments providing near-perfect GT camera parameters. These 2D-3D alignments allows to extract derivative annotations such as dense pixel correspondences, depth and segmentation maps. We demonstrate the use of NAVI image collections on different problem settings and show that NAVI enables more thorough evaluations that were not possible with existing datasets. We believe NAVI is beneficial for systematic research progress on 3D reconstruction and correspondence estimation. Project page: \url{} View details
    Vid2CAD: CAD Model Alignment using Multi-View Constraints from Videos
    Matthias Niessner
    Vittorio Ferrari
    Transactions on Pattern Analysis and Machine Intelligence (TPAMI)(2022)
    Preview abstract We address the task of aligning CAD models to a video sequence of a complex scene containing multiple objects. Our method can process arbitrary videos and fully automatically recover the 9 DoF pose for each object appearing in it, thus aligning them in a common 3D coordinate frame. The core idea of our method is to integrate neural network predictions from individual frames with a temporally global, multi-view constraint optimization formulation. This integration process resolves the scale and depth ambiguities in the per-frame predictions, and generally improves the estimate of all pose parameters. By leveraging multi-view constraints, our method also resolves occlusions and handles objects that are out of view in individual frames, thus reconstructing all objects into a single globally consistent CAD representation of the scene. In comparison to the state-of-the-art single-frame method Mask2CAD that we build on, we achieve substantial improvements on the Scan2CAD dataset (from 11.6% to 30.7% class average accuracy). View details
    Preview abstract We propose a transformer-based neural network architecture for multi-object 3D reconstruction from RGB videos. It relies on two alternative ways to represent its knowledge: as a global 3D grid of features and an array of view-specific 2D grids. We progressively exchange information between the two with a dedicated bidirectional attention mechanism. We exploit knowledge about the image formation process to significantly sparsify the attention weight matrix, making our architecture feasible on current hardware, both in terms of memory and computation. We attach a DETR-style head [9] on top of the 3D feature grid in order to detect the objects in the scene and to predict their 3D pose and 3D shape. Compared to previous methods, our architecture is single stage, end-to-end trainable, and it can reason holistically about a scene from multiple video frames without needing a brittle tracking step. We evaluate our method on the challenging Scan2CAD dataset [3], where we outperform (1) recent state-of-the-art methods [38,33] for 3D object pose estimation from RGB videos; and (2) a strong alternative method combining Multi-view Stereo [16] with RGB-D CAD alignment [4]. We plan to release our source code. View details
    CoReNet: Coherent 3D scene reconstruction from a single RGB image
    Pablo Bauszat
    Vittorio Ferrari
    The European Conference on Computer Vision (ECCV)(2020)
    Preview abstract Advances in deep learning techniques have allowed recent work to reconstruct the shape of a single object given only one RBG image as input. Building on common encoder-decoder architectures for this task, we propose three extensions: (1) ray-traced skip connections that propagate local 2D information to the output 3D volume in a physically correct manner; (2) a hybrid 3D volume representation that enables building translation equivariant models, while at the same time encoding fine object details without an excessive memory footprint; (3) a reconstruction loss tailored to capture overall object geometry. Furthermore, we adapt our model to address the harder task of reconstructing multiple objects from a single image. We reconstruct all objects jointly in one pass, producing a coherent reconstruction, where all objects live in a single consistent 3D coordinate frame relative to the camera and they do not intersect in 3D space. We also handle occlusions and resolve them by hallucinating the missing object parts in the 3D volume. We validate the impact of our contributions experimentally both on synthetic data from ShapeNet as well as real images from Pix3D. Our method outperforms the state-of-the-art single-object methods on both datasets. Finally, we evaluate performance quantitatively on multiple object reconstruction with synthetic scenes assembled from ShapeNet objects. View details
    Preview abstract We present Open Images V4, a dataset of 9.2M images with unified annotations for image classification, object detection and visual relationship detection. The images have a Creative Commons Attribution license that allows to share and adapt the material, and they have been collected from Flickr without a predefined list of class names or tags, leading to natural class statistics and avoiding an initial design bias. Open Images V4 offers large scale across several dimensions: 30.1M image-level labels for 19.8k concepts, 15.4M bounding boxes for 600 object classes, and 375k visual relationship annotations involving 57 classes. For object detection in particular, we provide 15x more bounding boxes than the next largest datasets (15.4M boxes on 1.9M images). The images often show complex scenes with several objects (8 annotated objects per image on average). We annotated visual relationships between them, which support visual relationship detection, an emerging task that requires structured reasoning. We provide in-depth comprehensive statistics about the dataset, we validate the quality of the annotations, we study how the performance of several modern models evolves with increasing amounts of training data, and we demonstrate two applications made possible by having unified annotations of multiple types coexisting in the same images. We hope that the scale, quality, and variety of Open Images V4 will foster further research and innovation even beyond the areas of image classification, object detection, and visual relationship detection. View details
    Preview abstract Manually annotating object segmentation masks is very time consuming. Interactive object segmentation methods offer a more efficient alternative where a human annotator and a machine segmentation model collaborate. In this paper we make several contributions to interactive segmentation: (1) we systematically explore in simulation the design space of deep interactive segmentation models and report new insights and caveats; (2) we execute a large-scale annotation campaign with real human annotators, producing masks for 2.5M instances on the OpenImages dataset. We have released this data publicly, forming (at the time of release) the largest existing dataset for instance segmentation. Moreover, by re-annotating part of the COCO dataset, we show that we can produce instance masks 3 times faster than traditional polygon drawing tools while also providing better quality. (3) We present a technique for automatically estimating the quality of the produced masks which exploits indirect signals from the annotation process. View details
    Preview abstract We propose to revisit knowledge transfer for training object detectors on target classes from weakly supervised training images, helped by a set of source classes with bounding-box annotations. We present a unified knowledge transfer framework based on training a single neural net- work multi-class object detector over all source classes, or- ganized in a semantic hierarchy. This generates proposals with scores at multiple levels in the hierarchy, which we use to explore knowledge transfer over a broad range of gen- erality, ranging from class-specific (bycicle to motorbike) to class-generic (objectness to any class). Experiments on the 200 object classes in the ILSVRC 2013 detection dataset show that our technique (1) leads to much better performance on the target classes (70.3% CorLoc, 36.9% mAP) than a weakly supervised baseline which uses man- ually engineered objectness [10] (50.5% CorLoc, 25.4% mAP). (2) delivers target object detectors reaching 80% of the mAP of their fully supervised counterparts. (3) outper- forms the best reported transfer learning results [17, 42] on this dataset (+41% CorLoc, +3% mAP). Moreover, we also carry out several across-dataset knowledge transfer exper- iments [25, 22, 32] and find that (4) our technique outper- forms the weakly supervised baseline in all dataset pairs by 1.5 × −1.9×, establishing its general applicability. View details