Learning the Depths of Moving People by Watching Frozen People
Abstract
We present a method for predicting dense depth in scenarios where both a monocular camera and people in the scene are freely moving. Existing methods for recovering depth for dynamic, non-rigid objects from monocular video impose strong assumptions on the objects' motion and often can recover only a sparse depth. In this paper, we take a data-driven approach and learn human depth priors from a large corpus of data. Specifically, we use a new source of data comprised of thousands of Internet videos in which people imitate mannequins, i.e., people freeze in diverse, natural poses, while a hand-held camera is touring the scene. We then create training data using modern Multi-View Stereo (MVS) methods, and design a model that is applied to dynamic scene at inference time. Our method makes use of motion parallax beyond single view and shows clear advantages over state-of-the-art monocular depth prediction methods. We demonstrate the applicability of our method on real-world sequences captured by a moving hand-held camera, depicting complex human actions. We show various 3D effects such as re-focusing, creating a stereoscopic video from a monocular one, and inserting virtual objects to the scene, all produced using our predicted depth maps.