Holistic Video Scene Understanding with ViP-DeepLab

April 27, 2021

Posted by Siyuan Qiao, Student Researcher and Liang-Chieh Chen, Research Scientist, Google Research

People are able to retrieve the visual information about 3D environments from a picture quite easily — we can identify objects, determine instance sizes, and reconstruct 3D scene layout, all using the limited signals contained in 2D images. This ability is commonly known as the inverse projection problem, which refers to reconstructing the ambiguous mapping from the retinal images to the sources of retinal stimulation. Real-world computer vision applications, such as autonomous driving, heavily rely on these capabilities to localize and identify 3D objects, which require vision models to infer the spatial location, semantic class, and instance label for each 3D point projected to the 2D images. The ability to reconstruct the 3D world from images can be decomposed into two disjoint computer vision tasks: monocular depth estimation (predicting depth from a single image) and video panoptic segmentation (the unification of instance segmentation and semantic segmentation, in the video domain). However, research has generally considered each task separately. Tackling these tasks jointly with a unified computer vision model could result in easier deployment and greater efficiency by sharing computation among multiple tasks.

Driven by the potential value of a model that predicts depth and video panoptic segmentation at the same time, we present “ViP-DeepLab: Learning Visual Perception with Depth-aware Video Panoptic Segmentation”, accepted to CVPR 2021. In this work, we propose a new task, depth-aware video panoptic segmentation, that aims to simultaneously tackle monocular depth estimation and video panoptic segmentation. For the new task, we present two derived datasets accompanied by a new evaluation metric called depth-aware video panoptic quality (DVPQ). This new metric includes the metrics for depth estimation and video panoptic segmentation, requiring a vision model to simultaneously tackle the two sub-tasks. To this end, we extend Panoptic-DeepLab by adding network branches for depth and video predictions to create ViP-DeepLab, a unified model that jointly performs video panoptic segmentation and monocular depth estimation for each pixel on the image plane, and achieves state-of-the-art performance on several academic datasets for the sub-tasks. This video demonstrates the new task and shows the results of ViP-DeepLab.

Depth-aware video panoptic segmentation results obtained by ViP-DeepLab. Top-left: Video frames used as input. Top-right: Video panoptic segmentation results. Bottom-left: Estimated depth. Bottom-right: Reconstructed 3D points. Each object instance has a unique and temporally consistent label, e.g., pedestrain_1, pedestrain_2, etc. Input images are from the Cityscapes dataset.


While Panoptic-DeepLab is able to output semantic segmentation, center prediction, and center regression for a single frame, it lacks the capability of depth estimation and temporally consistent instance ID prediction for multiple frames. However, ViP-DeepLab accomplishes this by performing additional predictions from two consecutive frames as input. The first additional output is depth estimation for the first frame, for which it assigns an estimated depth to each pixel. In addition, ViP-DeepLab also performs center regression for two consecutive frames for only the object centers that appear in the first frame. This process is called center offset prediction, and allows ViP-DeepLab to group all the pixels in the two frames to the same object that appears in the first frame. New instances emerge if they are not grouped to the previously detected instances. This process continues for every two consecutive frames (with one overlapping frame) in a video sequence, stitching panoptic predictions together to form predictions with temporally consistent instance IDs. That is, it stitches together where objects are and how they move in a video scene with time.

Outputs of ViP-DeepLab for video panoptic segmentation. Two consecutive frames are concatenated as input. The semantic segmentation output associates each pixel with its semantic classes, while the instance segmentation outputs identify the pixels from two frames associated with an individual object in the first frame. Input images are from the Cityscapes dataset.
Visualization of stitching video panoptic predictions. ViP-DeepLab propagates IDs based on mask intersection-over-union between region pairs. It is capable of tracking objects with large movements, e.g., the cyclist in the image.

Neural Network Design

Building on top of Panoptic-DeepLab, ViP-DeepLab additionally contains two prediction branches: (1) a depth prediction branch, and (2) a next-frame instance branch. Specifically, the depth prediction head is a simple design that predicts depth regression for every pixel, while the next-frame instance branch predicts the center offsets for the pixels in the second frame with respect to the centers in the first frame.


We have tested ViP-DeepLab on multiple popular benchmarks, including Cityscapes-VPS, KITTI Depth Prediction, and KITTI Multi-Object Tracking and Segmentation (MOTS).

Specifically, ViP-DeepLab achieves state-of-the-art (SOTA) results, significantly outperforming previous methods by 5.1% video panoptic quality (VPQ) on the Cityscapes-VPS test set.

Method VPQAll VPQThings VPQStuff
VPSNet 57.4% 45.8% 64.8%
ViP-DeepLab       62.5% (+5.1%)    50.2% (+4.4%)    70.3% (+5.5%)   

VPQ comparison on Cityscapes-VPS test set.

ViP-DeepLab ranks 1st on the KITTI depth prediction benchmark, improving over previous methods by 0.65 SILog (the smaller the better).

Method SILog   sqErrorRel   absErrorRel   iRMSE
PWA 11.45     2.30    9.05 12.32
ViP-DeepLab       10.80     2.19    8.94 11.77

Monocular depth estimation comparison on KITTI Depth Prediction benchmark. Note for the depth estimation metrics, the smaller the values, the better the performance. While differences may appear small, the top-performing method on this benchmark usually has a gap in SILog smaller than 0.1.

Additionally, ViP-DeepLab was also 1st on KITTI MOTS pedestrians and 3rd on KITTI MOTS cars ranked by the metric sMOTSA, and now is 3rd for both pedestrians and cars ranked by a newer metric HOTA.

Class Method HOTA
Car PointTrack 62.0%
ViP-DeepLab 76.4% (+14.4%)
Pedestrian       PointTrack 54.4%
ViP-DeepLab       64.3% (+9.9%)   

Performance comparison on KITTI Multi-Object Tracking and Segmentation.

Finally, we also present two new datasets for the new task, depth-aware video panoptic segmentation, and test ViP-DeepLab on them. We hope our ViP-DeepLab results on these two new datasets will serve as a strong baseline for the community to compare against. The results are shown below.

Dataset DVPQAll    DVPQThings    DVPQStuff   
Cityscapes-DVPS       55.1% 43.3% 63.6%
SemKITTI-DVPS 45.6% 36.6% 52.2%

ViP-DeepLab performance for the task of depth-aware video panoptic segmentation on two new datasets.


With a simple architecture, ViP-DeepLab achieves state-of-the-art performance on video panoptic segmentation, monocular depth estimation, and multi-object tracking and segmentation. We hope that along with MaX-DeepLab, which proposes an efficient dual-path transformer module that allows for end-to-end image panoptic segmentation, ViP-DeepLab is useful to the community and furthers research into a more holistic understanding of scenes in the real world.


We would like to thank the support and valuable discussions with Yukun Zhu, Hartwig Adam, and Alan Yuille (co-authors of ViP-DeepLab), as well as Maxwell Collins, and the Mobile Vision team.