Learning the Depths of Moving People by Watching Frozen People

Zhengqi Li
Ce Liu
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar

Abstract

We present a method for predicting dense depth in scenarios where both a monocular camera and people in the scene are freely moving. Existing methods for recovering depth for dynamic, non-rigid objects from monocular video impose strong assumptions on the objects' motion and often can recover only a sparse depth. In this paper, we take a data-driven approach and learn human depth priors from a large corpus of data. Specifically, we use a new source of data comprised of thousands of Internet videos in which people imitate mannequins, i.e., people freeze in diverse, natural poses, while a hand-held camera is touring the scene. We then create training data using modern Multi-View Stereo (MVS) methods, and design a model that is applied to dynamic scene at inference time. Our method makes use of motion parallax beyond single view and shows clear advantages over state-of-the-art monocular depth prediction methods. We demonstrate the applicability of our method on real-world sequences captured by a moving hand-held camera, depicting complex human actions. We show various 3D effects such as re-focusing, creating a stereoscopic video from a monocular one, and inserting virtual objects to the scene, all produced using our predicted depth maps.

Research Areas