Self-Supervised Learning of Structure and Motion from Video

Abstract

We propose SfM-Net, a geometry-aware neural network
for motion estimation in videos that decomposes frame-toframe
pixel motion in terms of scene and object depth, camera
motion and 3D object rotations and translations. Given
a sequence of frames, SfM-Net predicts depth, segmentation,
camera and rigid object motions, converts those into
a dense frame-to-frame motion field (optical flow), differentiably
warps frames in time to match pixels and backpropagates.
The model can be trained with various degrees
of supervision: 1) completely unsupervised, 2) supervised
by ego-motion (camera motion), 3) supervised by
depth (e.g., as provided by RGBD sensors), 4) supervised
by ground-truth optical flow. We show that SfM-Net successfully
estimates segmentation of the objects in the scene,
even though such supervision is never provided. It extracts
meaningful depth estimates or infills depth of RGBD sensors
and successfully estimates frame-to-frame camera displacements.
SfM-Net achieves state-of-the-art optical flow
performance. Our work is inspired by the long history of
research in geometry-aware motion estimation, Simultaneous
Localization and Mapping (SLAM) and Structure from
Motion (SfM). SfM-Net is an important first step towards
providing a learning-based approach for such tasks. A major
benefit over the existing optimization approaches is that
our proposed method can improve itself by processing more
videos, and by learning to explicitly model moving objects
in dynamic scenes.