Francis Engelmann

Research Areas

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Semantic annotations are indispensable to train or evaluate perception models, yet very costly to acquire. This work introduces a fully automated 2D/3D labeling framework that, without any human intervention, can generate labels for RGB-D scans at equal (or better) level of accuracy than comparable manually annotated datasets such as ScanNet. Our approach is based on an ensemble of state-of-the-art segmentation models and 3D lifting through neural rendering. We demonstrate the effectiveness of our LabelMaker pipeline by generating significantly better labels for the ScanNet datasets and automatically labelling the previously unlabeled ARKitScenes dataset. Code and models are available at https://labelmaker.org/ View details
    Preview abstract Existing 3D scene understanding methods are heavily focused on 3D semantic and instance segmentation. However, identifying objects and their parts only constitutes an intermediate step towards a more fine-grained goal, which is effectively interacting with the functional interactive elements (e.g., handles, knobs, buttons) in the scene to accomplish diverse tasks. To this end, we introduce SceneFun3D, a large-scale dataset with more than 14.8k highly accurate interaction annotations for 710 high-resolution real-world 3D indoor scenes. We accompany the annotations with motion parameter information, describing how to interact with these elements, and a diverse set of natural language descriptions of tasks that involve manipulating them in the scene context. To showcase the value of our dataset, we introduce three novel tasks, namely functionality segmentation, task-driven affordance grounding and 3D motion estimation, and adapt existing state-of-the-art methods to tackle them. Our experiments show that solving these tasks in real 3D scenes remains challenging despite recent progress in closed-set and open-set 3D scene understanding methods. View details
    AGILE3D: Attention Guided Interactive Multi-object 3D Segmentation
    Yuanwen Yue
    Sabarinath Mahadevan
    Jonas Schult
    Bastian Leibe
    Konrad Schindler
    Theodora Kontogianni
    ICLR(2024)
    Preview abstract During interactive segmentation, a model and a user work together to delineate objects of interest in a 3D point cloud. In an iterative process, the model assigns each data point to an object (or the background), while the user corrects errors in the resulting segmentation and feeds them back into the model. The current best practice formulates the problem as binary classification and segments objects one at a time. The model expects the user to provide positive clicks to indicate regions wrongly assigned to the background and negative clicks on regions wrongly assigned to the object. Sequentially visiting objects is wasteful since it disregards synergies between objects: a positive click for a given object can, by definition, serve as a negative click for nearby objects. Moreover, a direct competition between adjacent objects can speed up the identification of their common boundary. We introduce AGILE3D, an efficient, attention-based model that (1) supports simultaneous segmentation of multiple 3D objects, (2) yields more accurate segmentation masks with fewer user clicks, and (3) offers faster inference. Our core idea is to encode user clicks as spatial-temporal queries and enable explicit interactions between click queries as well as between them and the 3D scene through a click attention module. Every time new clicks are added, we only need to run a lightweight decoder that produces updated segmentation masks. In experiments with four different 3D point cloud datasets, AGILE3D sets a new state-of-the-art. Moreover, we also verify its practicality in real-world setups with real user studies. Project page: https://ywyue.github.io/AGILE3D. View details
    Preview abstract We introduce the task of open-vocabulary 3D instance segmentation. Traditional approaches for 3D instance segmentation largely rely on existing 3D annotated datasets, which are restricted to a closed-set of objects. This is an important limitation for real-life applications in which an autonomous agent might need to perform tasks guided by novel, open-vocabulary queries related to objects from a wider range of categories. Recently, open-vocabulary 3D scene understanding methods have emerged to address this problem by learning queryable features per each point in the scene. While such a representation can be directly employed to perform semantic segmentation, existing methods have no notion of object instances. In this work, we address the open-vocabulary 3D instance segmentation problem, and propose OpenMask3D, which is a zero-shot approach for open-vocabulary 3D instance segmentation. Guided by predicted class-agnostic 3D instance masks, our model aggregates per-mask features via multi-view fusion of CLIP-based image embeddings. We conduct experiments and ablation studies on the ScanNet200 dataset to evaluate the performance of OpenMask3D, and provide insights about the task of open-vocabulary 3D instance segmentation. We show that our approach outperforms other open-vocabulary counterparts particularly on the long-tail distribution. View details
    Preview abstract We propose a method to detect and reconstruct multiple 3D objects from a single 2D image. The method is based on a key-point detector that localizes object centers in the image and then predicts all necessary properties for multi-object reconstruction: oriented 3D bounding boxes, 3D shapes, and semantic class labels. By formulating 3D shape reconstruction as a classification problem, the method is agnostic to specific shape representations. Specifically, the method uses CAD/mesh models, to reconstruct realistic and visually pleasing shapes (unlike e.g. voxel-based methods) and relies on point clouds and voxel representations to formulate the loss functions. Our method formulates 3D shape reconstruction as a classification problem, i.e. selecting among exemplar CAD models from the training set. This makes it agnostic to shape representations, and enables the reconstruction of realistic and visually-pleasing shapes (unlike e.g. voxel-based methods). At the same time, we also rely on point clouds and voxel representations derived from the CAD models to formulate the loss functions. In particular, a collision-loss penalizes intersecting objects, further increasing the realism of the reconstructed scenes. The method is a single-stage approach, thus it is orders-ofmagnitude faster than two-stage approaches, it is fully differentiable and end-to-end trainable. View details
    3D-MPA: Multi Proposal Aggregation for 3D Semantic Instance Segmentation
    Bastian Leibe
    Matthias Niessner
    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(2020)
    Preview abstract We present 3D-MPA, a method for instance segmentation on 3D point clouds. Given an input point cloud, we propose an object-centric approach where each point votes for its object center. We sample object proposals from the predicted object centers. Then, we learn proposal features from grouped point features that voted for the same object center. A graph convolutional network introduces inter-proposal relations, providing higher-level feature learning in addition to the lower-level point features. Each proposal comprises a semantic label, a set of associated points over which we define a foreground-background mask, an objectness score and aggregation features. Previous works usually perform non-maximum-suppression (NMS) over proposals to obtain the final object detections or semantic instances. However, NMS can discard potentially correct predictions. Instead, our approach keeps all proposals and groups them together based on the learned aggregation features. We show that grouping proposals improves over NMS and outperforms previous state-of-the-art methods on the tasks of 3D object detection and semantic instance segmentation on the ScanNetV2 benchmark and the S3DIS dataset. View details
    No Results Found