Sean Fanello

Sean Fanello

I am a Research Scientist and Manager at Google, where I am leading efforts to solve real-world human perception tasks, often relying on performance capture and neural rendering to train deep learning models that generalize to in-the-wild applications. My research interests include: 3D performance capture, photo realistic rendering, neural rendering, relighting, viewpoint synthesis. Previously, I was a Senior Scientist and a Founding Team Member at perceptiveIO, Inc., where I developed computer vision and machine learning algorithms for 3D sensing, visual recognition and human-computer interaction. Prior to that, I was a Post-Doc Researcher in the Interactive 3D Technologies (I3D) group at Microsoft Research Redmond where I substantially contributed to the Hololens 3D sensing capabilities. I was also one of the main contributors for the Holoportation project. I obtained my PhD in Robotics, Cognition and Interaction Technologies at the Italian Institute of Technology in collaboration with the University of Genoa in 2013. During my PhD I developed computer vision and machine learning techniques for the iCub humanoid robot. In 2010 I completed my Master’s Degree in Computer Engineering at Sapienza University of Rome, with a specialization in Artificial Intelligence and Pattern Recognition. Personal website: http://seanfanello.it Google Scholar
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Sandwiched Compression: Repurposing Standard Codecs with Neural Network Wrappers
    Phil A. Chou
    Hugues Hoppe
    Danhang Tang
    Jonathan Taylor
    Philip Davidson
    arXiv:2402.05887 (2024)
    Preview abstract We propose sandwiching standard image and video codecs between pre- and post-processing neural networks. The networks are jointly trained through a differentiable codec proxy to minimize a given rate-distortion loss. This sandwich architecture not only improves the standard codec’s performance on its intended content, it can effectively adapt the codec to other types of image/video content and to other distortion measures. Essentially, the sandwich learns to transmit “neural code images” that optimize overall rate-distortion performance even when the overall problem is well outside the scope of the codec’s design. Through a variety of examples, we apply the sandwich architecture to sources with different numbers of channels, higher resolution, higher dynamic range, and perceptual distortion measures. The results demonstrate substantial improvements (up to 9 dB gains or up to 3 adaptations. We derive VQ equivalents for the sandwich, establish optimality properties, and design differentiable codec proxies approximating current standard codecs. We further analyze model complexity, visual quality under perceptual metrics, as well as sandwich configurations that offer interesting potentials in image/video compression and streaming. View details
    Multi-Camera Lighting Estimation for Photorealistic Front-Facing Mobile AR
    Yiqin Zhao
    Tian Guo
    Association for Computing Machinery, New York, NY, USA (2023), 68–73
    Preview abstract Lighting estimation plays an important role in virtual object composition, including mobile augmented reality (AR) applications. Prior work often targets recovering lighting from the physical environment to support photorealistic AR rendering. Because the common workflow is to use a backward-facing camera to capture the overlay of the physical world and virtual objects, we refer to this usage pattern as backward-facing AR. However, existing methods often fall short of supporting emerging front-facing virtual try-on applications where a mobile user leverages a front-facing camera to explore the effect of various products, e.g., glasses or hats, of different styles. This lack of support can be attributed to the unique challenges of obtaining 360◦ HDR environment maps, an ideal format of lighting representation, from the front-facing camera. In this paper, we propose to leverage a dual-camera streaming setup (front and backward-facing), to perform multi-view lighting estimation. Our approach results in improved rendering quality and visually coherent AR try-on experiences. Our contributions include energy conserving data capturing, high-quality environment map generation, and parametric directional light estimation. View details
    Learning Personalized High Quality Volumetric Head Avatars from Monocular RGB Videos
    Ziqian Bai
    Danhang "Danny" Tang
    Di Qiu
    Abhimitra Meka
    Mingsong Dou
    Ping Tan
    Thabo Beeler
    2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE
    Preview abstract We propose a method to learn a high-quality implicit 3D head avatar from a monocular RGB video captured in the wild. The learnt avatar is driven by a parametric face model to achieve user-controlled facial expressions and head poses. Our hybrid pipeline combines the geometry prior and dynamic tracking of a 3DMM with a neural radiance field to achieve fine-grained control and photorealism. To reduce over-smoothing and improve out-of-model expressions synthesis, we propose to predict local features anchored on the 3DMM geometry. These learnt features are driven by 3DMM deformation and interpolated in 3D space to yield the volumetric radiance at a designated query point. We further show that using a Convolutional Neural Network in the UV space is critical in incorporating spatial context and producing representative local features. Extensive experiments show that we are able to reconstruct high-quality avatars, with more accurate expression-dependent details, good generalization to out-of-training expressions, and quantitatively superior renderings compared to other state-of-the-art approaches. View details
    Sandwiched Image Compression: Increasing the resolution and dynamic range of standard codecs
    Phil Chou
    Hugues Hoppe
    Danhang "Danny" Tang
    Philip Davidson
    2022 Picture Coding Symposium (PCS), IEEE (to appear)
    Preview abstract Given a standard image codec, we compress images that may have higher resolution and/or higher bit depth than allowed in the codec's specifications, by sandwiching the standard codec between a neural pre-processor (before the standard encoder) and a neural post-processor (after the standard decoder). Using a differentiable proxy for the the standard codec, we design the neural pre- and post-processors to transport the high resolution (super-resolution, SR) or high bit depth (high dynamic range, HDR) images as lower resolution and lower bit depth images. The neural processors accomplish this with spatially coded modulation, which acts as watermarks to preserve the important image detail during compression. Experiments show that compared to conventional methods of transmitting high resolution or high bit depth through lower resolution or lower bit depth codecs, our sandwich architecture gains ~9 dB for SR images and ~3 dB for HDR images at the same rate over large test sets. We also observe significant gains in visual quality. View details
    Neural Light Transport for Relighting and View Synthesis
    Xiuming Zhang
    Yun-Ta Tsai
    Tiancheng Sun
    Tianfan Xue
    Philip Davidson
    Christoph Rhemann
    Paul Debevec
    Ravi Ramamoorthi
    ACM Transactions on Graphics, 40 (2021)
    Preview abstract The light transport (LT) of a scene describes how it appears under different lighting and viewing directions, and complete knowledge of a scene's LT enables the synthesis of novel views under arbitrary lighting. In this paper, we focus on image-based LT acquisition, primarily for human bodies within a light stage setup. We propose a semi-parametric approach to learn a neural representation of LT that is embedded in the space of a texture atlas of known geometric properties, and model all non-diffuse and global LT as residuals added to a physically-accurate diffuse base rendering. In particular, we show how to fuse previously seen observations of illuminants and views to synthesize a new image of the same scene under a desired lighting condition from a chosen viewpoint. This strategy allows the network to learn complex material effects (such as subsurface scattering) and global illumination, while guaranteeing the physical correctness of the diffuse LT (such as hard shadows). With this learned LT, one can relight the scene photorealistically with a directional light or an HDRI map, synthesize novel views with view-dependent effects, or do both simultaneously, all in a unified framework using a set of sparse, previously seen observations. Qualitative and quantitative experiments demonstrate that our neural LT (NLT) outperforms state-of-the-art solutions for relighting and view synthesis, without separate treatment for both problems that prior work requires. View details
    Multiresolution Deep Implicit Functions for 3D Shape Representation
    Zhang Chen
    Kyle Genova
    Sofien Bouaziz
    Christian Haene
    Cem Keskin
    Danhang "Danny" Tang
    ICCV (2021)
    Preview abstract We introduce Multiresolution Deep Implicit Functions (MDIF), a hierarchical representation that can recover fine details, while being able to perform more global operations such as shape completion. Our model represents a complex 3D shape with a hierarchy of latent grids, which can be decoded into different resolutions. Training is performed in an encoder-decoder manner, while the decoder-only optimization is supported during inference, hence can better generalize to novel objects, especially when performing shape completion. To the best of our knowledge, MDIF is the first model that can at the same time (1) reconstruct local detail; (2) perform decoder-only inference; (3) fulfill shape reconstruction and completion. We demonstrate superior performance against prior arts in our experiments. View details
    Preview abstract We propose a novel system for portrait relighting and background replacement, which maintains high-frequency boundary details and accurately synthesizes the subject’s appearance as lit by novel illumination, thereby producing realistic composite images for any desired scene. Our technique includes foreground estimation via alpha matting, relighting, and compositing. We demonstrate that each of these stages can be tackled in a sequential pipeline without the use of priors (e.g. known background or known illumination) and with no specialized acquisition techniques, using only a single RGB portrait image and a novel, target HDR lighting environment as inputs. We train our model using relit portraits of subjects captured in a light stage computational illumination system, which records multiple lighting conditions, high quality geometry, and accurate alpha mattes. To perform realistic relighting for compositing, we introduce a novel per-pixel lighting representation in a deep learning framework, which explicitly models the diffuse and the specular components of appearance, producing relit portraits with convincingly rendered non-Lambertian effects like specular highlights. Multiple experiments and comparisons show the effectiveness of the proposed approach when applied to in-the-wild images. View details
    HumanGPS: Geodesic PreServing Feature for Dense Human Correspondence
    Danhang "Danny" Tang
    Mingsong Dou
    Kaiwen Guo
    Cem Keskin
    Sofien Bouaziz
    Ping Tan
    Computer Vision and Pattern Recognition 2021 (2021), pp. 8
    Preview abstract In this paper, we address the problem of building dense correspondences between human images under arbitrary camera viewpoints and body poses. Prior art either assumes small motion between frames or relies on local descriptors, which cannot handle large motion or visually ambiguous body parts, e.g. left v.s. right hand. In contrast, we propose a deep learning framework that maps each pixel to a feature space, where the feature distances reflect the geodesic distances among pixels as if they were projected onto the surface of a 3D human scan. To this end, we introduce novel loss functions to push features apart according to their geodesic distances on the surface. Without any semantic annotation, the proposed embeddings automatically learn to differentiate visually similar parts and align different subjects into an unified feature space. Extensive experiments show that the learned embeddings can produce accurate correspondences between images with remarkable generalization capabilities on both intra and inter subjects. View details
    Sandwiched Image Compression: Wrapping Neural Networks Around a Standard Codec
    Phil Chou
    Hugues Hoppe
    Danhang "Danny" Tang
    Philip Davidson
    2021 IEEE International Conference on Image Processing (ICIP), IEEE, Anchorage, Alaska, pp. 3757-3761
    Preview abstract We sandwich a standard image codec between two neural networks: a preprocessor that outputs neural codes, and a postprocessor that reconstructs the image. The neural codes are compressed as ordinary images by the standard codec. Using differentiable proxies for both rate and distortion, we develop a rate-distortion optimization framework that trains the networks to generate neural codes that are efficiently compressible as images. This architecture not only improves rate-distortion performance for ordinary RGB images, but also enables efficient compression of alternative image types (such as normal maps of computer graphics) using standard image codecs. Results demonstrate the effectiveness and flexibility of neural processing in mapping a variety of input data modalities to the rigid structure of standard codecs. A surprising result is that the rate-distortion-optimized neural processing seamlessly learns to transport color images using a single-channel (grayscale) codec. View details
    Learning Illumination from Diverse Portraits
    Wan-Chun Alex Ma
    Christoph Rhemann
    Jason Dourgarian
    Paul Debevec
    SIGGRAPH Asia 2020 Technical Communications (2020)
    Preview abstract We present a learning-based technique for estimating high dynamic range (HDR), omnidirectional illumination from a single low dynamic range (LDR) portrait image captured under arbitrary indoor or outdoor lighting conditions. We train our model using portrait photos paired with their ground truth illumination. We generate a rich set of such photos by using a light stage to record the reflectance field and alpha matte of 70 diverse subjects in various expressions. We then relight the subjects using image-based relighting with a database of one million HDR lighting environments, compositing them onto paired high-resolution background imagery recorded during the lighting acquisition. We train the lighting estimation model using rendering-based loss functions and add a multi-scale adversarial loss to estimate plausible high frequency lighting detail. We show that our technique outperforms the state-of-the-art technique for portrait-based lighting estimation, and we also show that our method reliably handles the inherent ambiguity between overall lighting strength and surface albedo, recovering a similar scale of illumination for subjects with diverse skin tones. Our method allows virtual objects and digital characters to be added to a portrait photograph with consistent illumination. As our inference runs in real-time on a smartphone, we enable realistic rendering and compositing of virtual objects into live video for augmented reality. View details