Jump to Content

Neural Descent for Visual 3D Human Pose and Shape

Andrei Zanfir
Mihai Zanfir
Bill Freeman
Rahul Sukthankar
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), pp. 14484-14493


We present a deep neural network methodology to reconstruct the 3d pose and shape of people, given image or video inputs. We rely on a recently introduced, expressive full body statistical 3d human model, GHUM, with facial expression and hand detail and aim to learn to reconstruct the model pose and shape states in a self-supervised regime. Central to our methodology, is a learning to learn approach, referred to as HUman Neural Descent (HUND) that avoids both second-order differentiation when training the model parameters, and expensive state gradient descent in order to accurately minimize a semantic differentiable rendering loss at test time. Instead, we rely on novel recurrent stages to update the pose and shape parameters such that not only losses are minimized effectively but the process is regularized in order to ensure progress. The newly introduced architecture is tested extensively, and achieves state-of-the-art results on datasets like H3.6M and 3DPW, as well as in complex imagery collected in-the-wild.