Rohit Kumar Pandey
Rohit is a machine learning researcher and engineer in the augmented perception team at Google. His recent efforts are focused on applying deep learning to style transfer, novel view synthesis and relighting for humans. He has also worked on designing and implementing efficient deep learning solutions that can be deployed on mobile devices. Prior to Google, he graduated from the University at Buffalo, SUNY with a PhD in Computer Science, where his research focused on privacy preserving deep learning and its applications to biometric authentication.
Research Areas
Authored Publications
Sort By
Learning Personalized High Quality Volumetric Head Avatars from Monocular RGB Videos
Ziqian Bai
Danhang "Danny" Tang
Di Qiu
Abhimitra Meka
Mingsong Dou
Ping Tan
Thabo Beeler
2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE
Preview abstract
We propose a method to learn a high-quality implicit 3D head avatar from a monocular RGB video captured in the wild. The learnt avatar is driven by a parametric face model to achieve user-controlled facial expressions and head poses. Our hybrid pipeline combines the geometry prior and dynamic tracking of a 3DMM with a neural radiance field to achieve fine-grained control and photorealism. To reduce over-smoothing and improve out-of-model expressions synthesis, we propose to predict local features anchored on the 3DMM geometry. These learnt features are driven by 3DMM deformation and interpolated in 3D space to yield the volumetric radiance at a designated query point. We further show that using a Convolutional Neural Network in the UV space is critical in incorporating spatial context and producing representative local features. Extensive experiments show that we are able to reconstruct high-quality avatars, with more accurate expression-dependent details, good generalization to out-of-training expressions, and quantitatively superior renderings compared to other state-of-the-art approaches.
View details
HumanGPS: Geodesic PreServing Feature for Dense Human Correspondence
Danhang "Danny" Tang
Mingsong Dou
Kaiwen Guo
Cem Keskin
Sofien Bouaziz
Ping Tan
Computer Vision and Pattern Recognition 2021 (2021), pp. 8
Preview abstract
In this paper, we address the problem of building dense correspondences between human images under arbitrary camera viewpoints and body poses. Prior art either assumes small motion between frames or relies on local descriptors, which cannot handle large motion or visually ambiguous body parts, e.g. left v.s. right hand. In contrast, we propose a deep learning framework that maps each pixel to a feature space, where the feature distances reflect the geodesic distances among pixels as if they were projected onto the surface of a 3D human scan. To this end, we introduce novel loss functions to push features apart according to their geodesic distances on the surface. Without any semantic annotation, the proposed embeddings automatically learn to differentiate visually similar parts and align different subjects into an unified feature space. Extensive experiments show that the learned embeddings can produce accurate correspondences between images with remarkable generalization capabilities on both intra and inter subjects.
View details
Neural Light Transport for Relighting and View Synthesis
Xiuming Zhang
Yun-Ta Tsai
Tiancheng Sun
Tianfan Xue
Philip Davidson
Christoph Rhemann
Paul Debevec
Ravi Ramamoorthi
ACM Transactions on Graphics, 40 (2021)
Preview abstract
The light transport (LT) of a scene describes how it appears under different lighting and viewing directions, and complete knowledge of a scene's LT enables the synthesis of novel views under arbitrary lighting. In this paper, we focus on image-based LT acquisition, primarily for human bodies within a light stage setup. We propose a semi-parametric approach to learn a neural representation of LT that is embedded in the space of a texture atlas of known geometric properties, and model all non-diffuse and global LT as residuals added to a physically-accurate diffuse base rendering. In particular, we show how to fuse previously seen observations of illuminants and views to synthesize a new image of the same scene under a desired lighting condition from a chosen viewpoint. This strategy allows the network to learn complex material effects (such as subsurface scattering) and global illumination, while guaranteeing the physical correctness of the diffuse LT (such as hard shadows). With this learned LT, one can relight the scene photorealistically with a directional light or an HDRI map, synthesize novel views with view-dependent effects, or do both simultaneously, all in a unified framework using a set of sparse, previously seen observations. Qualitative and quantitative experiments demonstrate that our neural LT (NLT) outperforms state-of-the-art solutions for relighting and view synthesis, without separate treatment for both problems that prior work requires.
View details
Total Relighting: Learning to Relight Portraits for Background Replacement
Christian Haene
Sofien Bouaziz
Christoph Rhemann
Paul Debevec
SIGGRAPH and TOG (2021)
Preview abstract
We propose a novel system for portrait relighting and background replacement, which maintains high-frequency boundary details and accurately synthesizes the subject’s appearance as lit by novel illumination, thereby producing realistic composite images for any desired scene. Our technique includes foreground estimation via alpha matting, relighting, and compositing. We demonstrate that each of these stages can be tackled in a sequential pipeline without the use of priors (e.g. known background or known illumination) and with no specialized acquisition techniques, using only a single RGB portrait image and a novel, target HDR lighting environment as inputs. We train our model using relit portraits of subjects captured in a light stage computational illumination system, which records multiple lighting conditions, high quality geometry, and accurate alpha mattes. To perform realistic relighting for compositing, we introduce a novel per-pixel lighting representation in a deep learning framework, which explicitly models the diffuse and the specular components of appearance, producing relit portraits with convincingly rendered non-Lambertian effects like specular highlights. Multiple experiments and comparisons show the effectiveness of the proposed approach when applied to in-the-wild images.
View details
Learning Illumination from Diverse Portraits
Wan-Chun Alex Ma
Christoph Rhemann
Jason Dourgarian
Paul Debevec
SIGGRAPH Asia 2020 Technical Communications (2020)
Preview abstract
We present a learning-based technique for estimating high dynamic range (HDR), omnidirectional illumination from a single low dynamic range (LDR) portrait image captured under arbitrary indoor or outdoor lighting conditions. We train our model using portrait photos paired with their ground truth illumination. We generate a rich set of such photos by using a light stage to record the reflectance field and alpha matte of 70 diverse subjects in various expressions. We then relight the subjects using image-based relighting with a database of one million HDR lighting environments, compositing them onto paired high-resolution background imagery recorded during the lighting acquisition. We train the lighting estimation model using rendering-based loss functions and add a multi-scale adversarial loss to estimate plausible high frequency lighting detail. We show that our technique outperforms the state-of-the-art technique for portrait-based lighting estimation, and we also show that our method reliably handles the inherent ambiguity between overall lighting strength and surface albedo, recovering a similar scale of illumination for subjects with diverse skin tones. Our method allows virtual objects and digital characters to be added to a portrait photograph with consistent illumination. As our inference runs in real-time on a smartphone, we enable realistic rendering and compositing of virtual objects into live video for augmented reality.
View details
Deep Relightable Textures: Volumetric Performance Capture with Neural Rendering
Abhi Meka
Christian Haene
Peter Barnum
Philip Davidson
Daniel Erickson
Jonathan Taylor
Sofien Bouaziz
Wan-Chun Alex Ma
Ryan Overbeck
Thabo Beeler
Paul Debevec
Shahram Izadi
Christian Theobalt
Christoph Rhemann
SIGGRAPH Asia and TOG (2020)
Preview abstract
The increasing demand for 3D content in augmented and virtual reality has motivated the development of volumetric performance capture systems such as the Light Stage. Recent advances are pushing free viewpoint relightable videos of dynamic human performances closer to photorealistic quality. However, despite significant efforts, these sophisticated systems are limited by reconstruction and rendering algorithms which do not fully model complex 3D structures and higher order light transport effects such as global illumination and sub-surface scattering. In this paper, we propose a system that combines traditional geometric pipelines with a neural rendering scheme to generate photorealistic renderings of dynamic performances under desired viewpoint and lighting. Our system leverages deep neural networks that model the classical rendering process to learn implicit features that represent the view-dependent appearance of the subject independent of the geometry layout, allowing for generalization to unseen subject poses and even novel subject identity. Detailed experiments and comparisons demonstrate the efficacy and versatility of our method to generate high-quality results, significantly outperforming the existing state-of-the-art solutions.
View details
GeLaTO: Generative Latent Textured Objects
Ricardo Martin Brualla
Sofien Bouaziz
Matthew Brown
Dan B Goldman
European Conference on Computer Vision (2020)
Preview abstract
Accurate modeling of 3D objects exhibiting transparency, reflections and thin structures is an extremely challenging problem. Inspired by billboards and geometric proxies used in computer graphics, this paper proposes Generative Latent Textured Objects (GeLaTO), a compact representation that combines a set of coarse shape proxies defining low frequency geometry with learned neural textures, to encode both medium and fine scale geometry as well as view-dependent appearance. To generate the proxies' textures, we learn a joint latent space allowing category-level appearance and geometry interpolation. The proxies are independently rasterized with their corresponding neural texture and composited using a U-Net, which generates an output photorealistic image including an alpha map. We demonstrate the effectiveness of our approach by reconstructing complex objects from a sparse set of views. We show results on a dataset of real images of eyeglasses frames, which are particularly challenging to reconstruct with classical methods. We also demonstrate that these coarse proxies can be handcrafted when the underlying object geometry is easy to model, like eyeglasses, or generated using a neural network for more complex categories, such as cars.
View details
State of the Art on Neural Rendering
Ayush Tewari
Christian Theobalt
Dan B Goldman
Eli Shechtman
Gordon Wetzstein
Jason Saragih
Jun-Yan Zhu
Justus Thies
Kalyan Sunkavalli
Maneesh Agrawala
Matthias Niessner
Michael Zollhöfer
Ohad Fried
Ricardo Martin Brualla
Stephen Lombardi
Tomas Simon
Vincent Sitzmann
Computer Graphics Forum (2020)
Preview abstract
The efficient rendering of photo-realistic virtual worlds is a long standing effort of computer graphics. Over the last few years, rapid orthogonal progress in deep generative models has been made by the computer vision and machine learning communities leading to powerful algorithms to synthesize and edit images. Neural rendering approaches are a hybrid of both of these efforts that combine physical knowledge, such as a differentiable renderer, with learned components for controllable image synthesis. Nowadays, neural rendering is employed for solving a steadily growing number of computer graphics and vision problems. This state-of-the-art report summarizes the recent trends and applications of neural rendering. We focus on approaches that combine classic computer graphics techniques with deep generative models to obtain controllable and photo-realistic outputs. Starting with an overview of the underlying computer graphics and machine learning concepts, we discuss critical aspects of neural rendering approaches. Specifically, we are dealing with the type of control, i.e., how the control is provided, which parts of the pipeline are learned, explicit vs. implicit
control, generalization, and stochastic vs. deterministic synthesis. The second half of this state-of-the-art report is focused on the many important use cases for the described algorithms such as novel view synthesis, semantic photo manipulation, facial and body reenactment, re-lighting, free-viewpoint video, and the creation of photo-realistic avatars for virtual and augmented reality telepresence. Finally, we conclude with a discussion of the social implications of such technology and investigate open research problems.
View details
The Relightables: Volumetric Performance Capture of Humans with Realistic Relighting
Kaiwen Guo
Peter Lincoln
Philip Davidson
Xueming Yu
Matt Whalen
Geoff Harvey
Jason Dourgarian
Danhang Tang
Anastasia Tkach
Emily Cooper
Mingsong Dou
Graham Fyffe
Christoph Rhemann
Jonathan Taylor
Paul Debevec
Shahram Izadi
SIGGRAPH Asia (2019) (to appear)
Preview abstract
We present ''The Relightables'', a volumetric capture system for photorealistic and high quality relightable full-body performance capture. While significant progress has been made on volumetric capture systems, focusing on 3D geometric reconstruction with high resolution textures, much less work has been done to recover photometric properties needed for relighting. Results from such systems lack high-frequency details and the subject's shading is prebaked into the texture. In contrast, a large body of work has addressed relightable acquisition for image-based approaches, which photograph the subject under a set of basis lighting conditions and recombine the images to show the subject as they would appear in a target lighting environment. However, to date, these approaches have not been adapted for use in the context of a high-resolution volumetric capture system. Our method combines this ability to realistically relight humans for arbitrary environments, with the benefits of free-viewpoint volumetric capture and new levels of geometric accuracy for dynamic performances. Our subjects are recorded inside a custom geodesic sphere outfitted with 331 custom color LED lights, an array of high-resolution cameras, and a set of custom high-resolution depth sensors. Our system innovates in multiple areas: First, we designed a novel active depth sensor to capture 12.4MP depth maps, which we describe in detail. Second, we show how to design a hybrid geometric and machine learning reconstruction pipeline to process the high resolution input and output a volumetric video. Third, we generate temporally consistent reflectance maps for dynamic performers by leveraging the information contained in two alternating color gradient illumination images acquired at 60Hz. Multiple experiments, comparisons, and applications show that The Relightables significantly improves upon the level of realism in placing volumetrically captured human performances into arbitrary CG scenes.
View details
Volumetric Capture of Humans with a Single RGBD Camera via Semi-Parametric Learning
Anastasia Tkach
Shuoran Yang
Pavel Pidlypenskyi
Jonathan Taylor
Ricardo Martin Brualla
George Papandreou
Philip Davidson
Cem Keskin
Shahram Izadi
CVPR (2019)
Preview abstract
Volumetric (4D) performance capture is fundamental for AR/VR content generation. Whereas previous work in 4D performance capture has shown impressive results in studio settings, the technology is still far from being accessible to a typical consumer who, at best, might own a single RGBD sensor. Thus, in this work, we propose a method to synthesize free viewpoint renderings using a single RGBD camera. The key insight is to leverage previously seen "calibration" images of a given user to extrapolate what should be rendered in a novel viewpoint from the data available in the sensor. Given these past observations from multiple viewpoints, and the current RGBD image from a fixed view, we propose an end-to-end framework that fuses both these data sources to generate novel renderings of the performer. We demonstrate that the method can produce high fidelity images, and handle extreme changes in subject pose and camera viewpoints. We also show that the system generalizes to performers not seen in the training data. We run exhaustive experiments demonstrating the effectiveness of the proposed semi-parametric model (i.e. calibration images available to the neural network) compared to other state of the art machine learned solutions. Further, we compare the method with more traditional pipelines that employ multi-view capture. We show that our framework is able to achieve compelling results, with substantially less infrastructure than previously required.
View details