Future Semantic Segmentation Leveraging 3D Information
Abstract
Predicting the future to anticipate the outcome of events and actions is a critical attribute of autonomous agents. In this work, we address the task of predicting future frame segmentation from a stream of monocular video by leveraging the 3D structure of the scene. Our framework is based on learnable sub-modules capable of predicting pixelwise scene semantic labels, depth, and camera ego-motion of adjacent frames. Ultimately, we observe that leveraging 3D structure in the model facilitates successful positioning of objects in the 3D scene, achieving state of the art accuracy in future semantic segmentation.