OpenMask3D: Open-Vocabulary 3D Instance Segmentation
Abstract
We introduce the task of open-vocabulary 3D instance segmentation.
Traditional approaches for 3D instance segmentation largely rely on existing 3D annotated datasets, which are restricted to a closed-set of objects.
This is an important limitation for real-life applications in which an autonomous agent might need to perform tasks guided by novel, open-vocabulary queries related to objects from a wider range of categories.
Recently, open-vocabulary 3D scene understanding methods have emerged to address this problem by learning queryable features per each point in the scene. While such a representation can be directly employed to perform semantic segmentation, existing methods have no notion of object instances.
In this work, we address the open-vocabulary 3D instance segmentation problem, and propose OpenMask3D, which is a zero-shot approach for open-vocabulary 3D instance segmentation.
Guided by predicted class-agnostic 3D instance masks, our model aggregates per-mask features via multi-view fusion of CLIP-based image embeddings.
We conduct experiments and ablation studies on the ScanNet200 dataset to evaluate the performance of OpenMask3D, and provide insights about the task of open-vocabulary 3D instance segmentation. We show that our approach outperforms other open-vocabulary counterparts particularly on the long-tail distribution.
Traditional approaches for 3D instance segmentation largely rely on existing 3D annotated datasets, which are restricted to a closed-set of objects.
This is an important limitation for real-life applications in which an autonomous agent might need to perform tasks guided by novel, open-vocabulary queries related to objects from a wider range of categories.
Recently, open-vocabulary 3D scene understanding methods have emerged to address this problem by learning queryable features per each point in the scene. While such a representation can be directly employed to perform semantic segmentation, existing methods have no notion of object instances.
In this work, we address the open-vocabulary 3D instance segmentation problem, and propose OpenMask3D, which is a zero-shot approach for open-vocabulary 3D instance segmentation.
Guided by predicted class-agnostic 3D instance masks, our model aggregates per-mask features via multi-view fusion of CLIP-based image embeddings.
We conduct experiments and ablation studies on the ScanNet200 dataset to evaluate the performance of OpenMask3D, and provide insights about the task of open-vocabulary 3D instance segmentation. We show that our approach outperforms other open-vocabulary counterparts particularly on the long-tail distribution.