Jump to Content
Adarsh Kowdle

Adarsh Kowdle

I am a Senior Staff Software Engineer and R&D group Manager on Google's Augmented Reality team leading the efforts around geometric and human perception, working on end-to-end solutions from research to product at the intersection of real-time computer vision, geometric/human sensing and applied machine learning such as ARCore Depth API, Relightables. Previously at Google, I was the Hardware/Systems Lead for uDepth: real-time active depth sensing on Pixel 4 that powers Face Unlock and computational photography use cases such as bokeh. My areas of interest are computer vision and machine learning with a focus on real-time applications.

Previously, I was a Senior Scientist and part of the founding team at perceptiveIO, where I developed computer vision and machine learning algorithms for 3D sensing, visual recognition and human-computer interaction. Prior to this, I spent 3 years at Microsoft as a Senior SDE / Researcher in the Applied Vision and Imaging Team at Microsoft, where I worked on Surface Hub among other projects. I also worked with the Interactive 3D Technologies group at Microsoft Research at Redmond for 6 months on projects such as Holoportation.

I graduated with a PhD in Electrical and Computer Engineering from Cornell University in July 2013. I was advised by Prof. Tsuhan Chen. My thesis focus was on interactive computer vision algorithms and image based modeling; putting the user in the loop intelligently by leveraging the power of the automatic algorithm.

Google Scholar Page
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Rapsai: Accelerating Machine Learning Prototyping of Multimedia Applications through Visual Programming
    Ram Iyengar
    Na Li
    Jing Jin
    Michelle Carney
    Scott Joseph Miles
    Maria Kleiner
    Xiuxiu Yuan
    Anuva Kulkarni
    Xingyu “Bruce” Liu
    Ahmed K Sabie
    Ping Yu
    Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI), ACM
    Preview abstract In recent years, there has been a proliferation of multimedia applications that leverage machine learning (ML) for interactive experiences. Prototyping ML-based applications is, however, still challenging, given complex workflows that are not ideal for design and experimentation. To better understand these challenges, we conducted a formative study with seven ML practitioners to gather insights about common ML evaluation workflows. This study helped us derive six design goals, which informed Rapsai, a visual programming platform for rapid and iterative development of end-to-end ML-based multimedia applications. Rapsai is based on a node-graph editor to facilitate interactive characterization and visualization of ML model performance. Rapsai streamlines end-to-end prototyping with interactive data augmentation and model comparison capabilities in its no-coding environment. Our evaluation of Rapsai in four real-world case studies (N=15) suggests that practitioners can accelerate their workflow, make more informed decisions, analyze strengths and weaknesses, and holistically evaluate model behavior with real-world input. View details
    Experiencing Visual Blocks for ML: Visual Prototyping of AI Pipelines
    Na Li
    Jing Jin
    Michelle Carney
    Jun Jiang
    Xiuxiu Yuan
    Kristen Wright
    Mark Sherwood
    Jason Mayes
    Lin Chen
    Jingtao Zhou
    Zhongyi Zhou
    Ping Yu
    Ram Iyengar
    ACM (2023) (to appear)
    Preview abstract We demonstrate Visual Blocks for ML, a visual programming platform that facilitates rapid prototyping of ML-based multimedia applications. As the public version of Rapsai , we further integrated large language models and custom APIs into the platform. In this demonstration, we will showcase how to build interactive AI pipelines in a few drag-and-drops, how to perform interactive data augmentation, and how to integrate pipelines into Colabs. In addition, we demonstrate a wide range of community-contributed pipelines in Visual Blocks for ML, covering various aspects including interactive graphics, chains of large language models, computer vision, and multi-modal applications. Finally, we encourage students, designers, and ML practitioners to contribute ML pipelines through https://github.com/google/visualblocks/tree/main/pipelines to inspire creative use cases. Visual Blocks for ML is available at http://visualblocks.withgoogle.com. View details
    Experiencing Rapid Prototyping of Machine Learning Based Multimedia Applications in Rapsai
    Na Li
    Jing Jin
    Michelle Carney
    Xiuxiu Yuan
    Ping Yu
    Ram Iyengar
    CHI EA '23: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems, ACM, 448:1-4
    Preview abstract We demonstrate Rapsai, a visual programming platform that aims to streamline the rapid and iterative development of end-to-end machine learning (ML)-based multimedia applications. Rapsai features a node-graph editor that enables interactive characterization and visualization of ML model performance, which facilitates the understanding of how the model behaves in different scenarios. Moreover, the platform streamlines end-to-end prototyping by providing interactive data augmentation and model comparison capabilities within a no-coding environment. Our demonstration showcases the versatility of Rapsai through several use cases, including virtual background, visual effects with depth estimation, and audio denoising. The implementation of Rapsai is intended to support ML practitioners in streamlining their workflow, making data-driven decisions, and comprehensively evaluating model behavior with real-world input. View details
    DepthLab: Real-time 3D Interaction with Depth Maps for Mobile Augmented Reality
    Maksym Dzitsiuk
    Luca Prasso
    Ivo Duarte
    Jason Dourgarian
    Joao Afonso
    Jose Pascoal
    Josh Gladstone
    Nuno Moura e Silva Cruces
    Shahram Izadi
    Konstantine Nicholas John Tsotsos
    Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, ACM (2020), pp. 829-843
    Preview abstract Mobile devices with passive depth sensing capabilities are ubiquitous, and recently active depth sensors have become available on some tablets and VR/AR devices. Although real-time depth data is accessible, its rich value to mainstream AR applications has been sorely under-explored. Adoption of depth-based UX has been impeded by the complexity of performing even simple operations with raw depth data, such as detecting intersections or constructing meshes. In this paper, we introduce DepthLab, a software library that encapsulates a variety of depth-based UI/UX paradigms, including geometry-aware rendering (occlusion, shadows), surface interaction behaviors (physics-based collisions, avatar path planning), and visual effects (relighting, depth-of-field effects). We break down depth usage into localized depth, surface depth, and dense depth, and describe our real-time algorithms for interaction and rendering tasks. We present the design process, system, and components of DepthLab to streamline and centralize the development of interactive depth features. We have open-sourced our software to external developers, conducted performance evaluation, and discussed how DepthLab can accelerate the workflow of mobile AR designers and developers. We envision that DepthLab may help mobile AR developers amplify their prototyping efforts, empowering them to unleash their creativity and effortlessly integrate depth into mobile AR experiences. View details
    Experiencing Real-time 3D Interaction with Depth Maps for Mobile Augmented Reality in DepthLab
    Maksym Dzitsiuk
    Luca Prasso
    Ivo Duarte
    Jason Dourgarian
    Joao Afonso
    Jose Pascoal
    Josh Gladstone
    Nuno Moura e Silva Cruces
    Shahram Izadi
    Konstantine Nicholas John Tsotsos
    Adjunct Publication of the 33rd Annual ACM Symposium on User Interface Software and Technology, ACM (2020), pp. 108-110
    Preview abstract We demonstrate DepthLab, a wide range of experiences using the ARCore Depth API that allows users to detect the shape and depth in the physical environment with a mobile phone. DepthLab encapsulates a variety of depth-based UI/UX paradigms, including geometry-aware rendering (occlusion, shadows, texture decals), surface interaction behaviors (physics, collision detection, avatar path planning), and visual effects (relighting, 3D-anchored focus and aperture effects, 3D photos). We have open-sourced our software at https://github.com/googlesamples/arcore-depth-lab to facilitate future research and development in depth-aware mobile AR experiences. With DepthLab, we aim to help mobile developers to effortlessly integrate depth into their AR experiences and amplify the expression of their creative vision. View details
    Deep Reflectance Fields - High-Quality Facial Reflectance Field Inference from Color Gradient Illumination
    Abhi Meka
    Christian Haene
    Michael Zollhöfer
    Graham Fyffe
    Xueming Yu
    Jason Dourgarian
    Peter Denny
    Sofien Bouaziz
    Peter Lincoln
    Matt Whalen
    Geoff Harvey
    Jonathan Taylor
    Shahram Izadi
    Paul Debevec
    Christian Theobalt
    Julien Valentin
    Christoph Rhemann
    SIGGRAPH (2019)
    Preview abstract Photo-realistic relighting of human faces is a highly sought after feature with many applications ranging from visual effects to truly immersive virtual experiences. Despite tremendous technological advances in the field, humans are often capable of distinguishing real faces from synthetic renders. Photo-realistically relighting any human face is indeed a challenge with many difficulties going from modelling sub-surface scattering and blood flow to estimating the interaction between light and individual strands of hair. We introduce the first system that combines the ability to deal with dynamic performances to the realism of 4D reflectance fields, enabling photo-realistic relighting of non-static faces. The core of our method consists of a Deep Neural network that is able to predict full 4D reflectance fields from two images captured under spherical gradient illumination. Extensive experiments not only show that two images under spherical gradient illumination can be easily captured in real time, but also that these particular images contain all the information needed to estimate the full reflectance field, including specularities and high frequency details. Finally, side by side comparisons demonstrate that the proposed system outperforms the current state-of-the-art in terms of realism and speed. View details
    The Relightables: Volumetric Performance Capture of Humans with Realistic Relighting
    Kaiwen Guo
    Peter Lincoln
    Philip Davidson
    Xueming Yu
    Matt Whalen
    Geoff Harvey
    Jason Dourgarian
    Danhang Tang
    Anastasia Tkach
    Emily Cooper
    Mingsong Dou
    Graham Fyffe
    Christoph Rhemann
    Jonathan Taylor
    Paul Debevec
    Shahram Izadi
    SIGGRAPH Asia (2019) (to appear)
    Preview abstract We present ''The Relightables'', a volumetric capture system for photorealistic and high quality relightable full-body performance capture. While significant progress has been made on volumetric capture systems, focusing on 3D geometric reconstruction with high resolution textures, much less work has been done to recover photometric properties needed for relighting. Results from such systems lack high-frequency details and the subject's shading is prebaked into the texture. In contrast, a large body of work has addressed relightable acquisition for image-based approaches, which photograph the subject under a set of basis lighting conditions and recombine the images to show the subject as they would appear in a target lighting environment. However, to date, these approaches have not been adapted for use in the context of a high-resolution volumetric capture system. Our method combines this ability to realistically relight humans for arbitrary environments, with the benefits of free-viewpoint volumetric capture and new levels of geometric accuracy for dynamic performances. Our subjects are recorded inside a custom geodesic sphere outfitted with 331 custom color LED lights, an array of high-resolution cameras, and a set of custom high-resolution depth sensors. Our system innovates in multiple areas: First, we designed a novel active depth sensor to capture 12.4MP depth maps, which we describe in detail. Second, we show how to design a hybrid geometric and machine learning reconstruction pipeline to process the high resolution input and output a volumetric video. Third, we generate temporally consistent reflectance maps for dynamic performers by leveraging the information contained in two alternating color gradient illumination images acquired at 60Hz. Multiple experiments, comparisons, and applications show that The Relightables significantly improves upon the level of realism in placing volumetrically captured human performances into arbitrary CG scenes. View details
    LookinGood: Enhancing Performance Capture with Real-Time Neural Re-Rendering
    Ricardo Martin Brualla
    Shuoran Yang
    Pavel Pidlypenskyi
    Jonathan Taylor
    Julien Valentin
    Sameh Khamis
    Philip Davidson
    Anastasia Tkach
    Peter Lincoln
    Christoph Rhemann
    Cem Keskin
    Steve Seitz
    Shahram Izadi
    SIGGRAPH Asia (2018)
    Preview abstract Motivated by augmented and virtual reality applications such as telepresence, there has been a recent focus in real-time performance capture of humans under motion. However, given the real-time constraint, these systems often suffer from artifacts in geometry and texture such as holes and noise in the final rendering, poor lighting, and low-resolution textures. We take the novel approach to augment such real-time performance capture systems with a deep architecture that takes a rendering from an arbitrary viewpoint, and jointly performs completion, super resolution, and denoising of the imagery in real-time. We call this approach neural (re-)rendering, and our live system "LookinGood". Our deep architecture is trained to produce high resolution and high quality images from a coarse rendering in real-time. First, we propose a self-supervised training method that does not require manual ground-truth annotation. We contribute a specialized reconstruction error that uses semantic information to focus on relevant parts of the subject, e.g. the face. We also introduce a salient reweighing scheme of the loss function that is able to discard outliers. We specifically design the system for virtual and augmented reality headsets where the consistency between the left and right eye plays a crucial role in the final user experience. Finally, we generate temporally stable results by explicitly minimizing the difference between two consecutive frames. We tested the proposed system in two different scenarios: one involving a single RGB-D sensor, and upper body reconstruction of an actor, the second consisting of full body 360 degree capture. Through extensive experimentation, we demonstrate how our system generalizes across unseen sequences and subjects. View details
    Real-time Compression and Streaming of 4D Performances
    Danhang Tang
    Mingsong Dou
    Peter Lincoln
    Philip Davidson
    Kaiwen Guo
    Jonathan Taylor
    Cem Keskin
    Sofien Bouaziz
    Shahram Izadi
    ACM Transaction of Graphics (2018)
    Preview abstract We introduce a realtime compression architecture for 4D performance capture that is two orders of magnitude faster than current state-of-the-art techniques, yet achieves comparable visual quality and bitrate. We note how much of the algorithmic complexity in traditional 4D compression arises from the necessity to encode geometry in a explicit model (i.e. a triangle mesh). In contrast, we propose an encoder that leverages implicit model to represent the observed geometry and its changes through time View details
    SOS: Stereo Matching in O(1) with Slanted Support Windows
    Vladimir Tankovich
    Michael John Schoenberg
    Christoph Rhemann
    Mirko Schmidt
    Maksym Dzitsiuk
    Julien Valentin
    Shahram Izadi
    IROS (2018)
    Preview abstract Depth cameras have accelerated research in many areas of computer vision. Most triangulation-based depth cameras, whether structured light systems like the Kinect or active (assisted) stereo systems, are based on the principle of stereo matching. Depth from stereo is an active research topic dating back 30 years. Despite recent advances, algorithms usually trade-off accuracy for speed. In particular, efficient methods rely on fronto-parallel assumptions to reduce the search space and keep computation low. We present SOS (Slanted O (1) Stereo), the first algorithm capable of leveraging slanted support windows without sacrificing speed or accuracy. We use an active stereo configuration, where an illuminator textures the scene. Under this setting, local methods-such as PatchMatch Stereo-obtain state of the art results by jointly estimating disparities and slant, but at a large computational cost. We observe that these methods typically exploit local smoothness to simplify their initialization strategies. Our key insight is that local smoothness can in fact be used to amortize the computation not only within initialization, but across the entire stereo pipeline. Building on these insights, we propose a novel hierarchical initialization that is able to efficiently perform search over disparity and slants. We then show how this structure can be leveraged to provide high quality depth maps. Extensive quantitative evaluations demonstrate that the proposed technique yields significantly more precise results than current state of the art, but at a fraction of the computational cost. Our prototype implementation runs at 4000 fps on modern GPU architectures. View details
    Depth from motion for smartphone AR
    Julien Valentin
    Neal Wadhwa
    Max Dzitsiuk
    Michael John Schoenberg
    Vivek Verma
    Ambrus Csaszar
    Ivan Dryanovski
    Joao Afonso
    Jose Pascoal
    Konstantine Nicholas John Tsotsos
    Mira Angela Leung
    Mirko Schmidt
    Sameh Khamis
    Vladimir Tankovich
    Shahram Izadi
    Christoph Rhemann
    ACM Transactions on Graphics (2018)
    Preview abstract Augmented reality (AR) for smartphones has matured from a technology for earlier adopters, available only on select high-end phones, to one that is truly available to the general public. One of the key breakthroughs has been in low-compute methods for six degree of freedom (6DoF) tracking on phones using only the existing hardware (camera and inertial sensors). 6DoF tracking is the cornerstone of smartphone AR allowing virtual content to be precisely locked on top of the real world. However, to really give users the impression of believable AR, one requires mobile depth. Without depth, even simple effects such as a virtual object being correctly occluded by the real-world is impossible. However, requiring a mobile depth sensor would severely restrict the access to such features. In this article, we provide a novel pipeline for mobile depth that supports a wide array of mobile phones, and uses only the existing monocular color sensor. Through several technical contributions, we provide the ability to compute low latency dense depth maps using only a single CPU core of a wide range of (medium-high) mobile phones. We demonstrate the capabilities of our approach on high-level AR applications including real-time navigation and shopping. View details
    StereoNet: Guided Hierarchical Refinement for Edge-Aware Depth Prediction
    Sameh Khamis
    Christoph Rhemann
    Julien Valentin
    Shahram Izadi
    European Conference on Computer Vision (2018)
    Preview abstract This paper presents StereoNet, the first end-to-end deep architecture for real-time stereo matching that runs at 60fps on an NVidia Titan X, producing high-quality, edge-preserved, quantization-free depth maps. A key insight of this paper is that the network achieves a sub-pixel matching precision than is a magnitude higher than those of traditional stereo matching approaches. This allows us to achieve real-time performance by using a very low resolution cost volume that encodes all the information needed to achieve high depth precision. Spatial precision is achieved by employing a learned edge-aware upsampling function. Our model uses a Siamese network to extract features from the left and right image. A first estimate of the disparity is computed in a very low resolution cost volume, then hierarchically the model re-introduces high-frequency details through a learned upsampling function that uses compact pixel-to-pixel refinement networks. Leveraging color input as a guide, this function is capable of producing high-quality edge-aware output. We achieve compelling results on multiple benchmarks, showing how the proposed method offers extreme flexibility at an acceptable computational budget. View details
    The Need 4 Speed in Real-Time Dense Visual Tracking
    Christoph Rhemann
    Jonathan Taylor
    Philip Davidson
    Mingsong Dou
    Kaiwen Guo
    Cem Keskin
    Sameh Khamis
    Danhang Tang
    Vladimir Tankovich
    Julien Valentin
    Shahram Izadi
    SIGGRAPH Asia (2018)
    Preview abstract The advent of consumer depth cameras has incited the development of a new cohort of algorithms tackling challenging computer vision problems. The primary reason is that depth provides direct geometric information that is largely invariant to texture and illumination. As such, substantial progress has been made in human and object pose estimation, 3D reconstruction and simultaneous localization and mapping. Most of these algorithms naturally benefit from the ability to accurately track the pose of an object or scene of interest from one frame to the next. However, commercially available depth sensors (typically running at 30fps) can allow for large inter-frame motions to occur that make such tracking problematic. A high frame rate depth camera would thus greatly ameliorate these issues, and further increase the tractability of these computer vision problems. Nonetheless, the depth accuracy of recent systems for high-speed depth estimation [Fanello et al. 2017b] can degrade at high frame rates. This is because the active illumination employed produces a low SNR and thus a high exposure time is required to obtain a dense accurate depth image. Furthermore in the presence of rapid motion, longer exposure times produce artifacts due to motion blur, and necessitates a lower frame rate that introduces large inter-frame motion that often yield tracking failures. In contrast, this paper proposes a novel combination of hardware and software components that avoids the need to compromise between a dense accurate depth map and a high frame rate. We document the creation of a full 3D capture system for high speed and quality depth estimation, and demonstrate its advantages in a variety of tracking and reconstruction tasks. We extend the state of the art active stereo algorithm presented in Fanello et al. [2017b] by adding a space-time feature in the matching phase. We also propose a machine learning based depth refinement step that is an order of magnitude faster than traditional postprocessing methods. We quantitatively and qualitatively demonstrate the benefits of the proposed algorithms in the acquisition of geometry in motion. Our pipeline executes in 1.1ms leveraging modern GPUs and off-the-shelf cameras and illumination components. We show how the sensor can be employed in many different applications, from [non-]rigid reconstructions to hand/face tracking. Further, we show many advantages over existing state of the art depth camera technologies beyond framerate, including latency, motion artifacts, multi-path errors, and multi-sensor interference. View details
    Preview abstract Real time non-rigid reconstruction pipelines are extremely computationally expensive and easily saturate the highest end GPUs currently available. This requires careful strategic choices to be made about a set of highly interconnected parameters that divide up the limited compute. Offline systems, however, prove the value of increasing voxel resolution, more iterations, and higher frame rates. To this end, we demonstrate a set of remarkably simple, but effective modifications to these algorithms that significantly reduce the average per-frame computation cost allowing these parameters to be increased. Specifically, we divide the depth stream into sub-frames and fusion-frames, disabling both model accumulation (fusion) and non-rigid alignment (model tracking) on the former. Instead, we efficiently track point correspondences across neighboring sub-frames. We then leverage these correspondences to initialize the standard non-rigid alignment to a fusion-frame where data can then be accumulated into the model. As a result, compute resources in the modified non-rigid reconstruction pipeline can be immediately re-purposed. Finally, we leverage recent high framerate depth algorithms to build a novel “twin” sensor consisting of a low-res/high-fps sub-frame camera and a second low-fps/high-res fusion camera. View details
    UltraFast 3D Sensing, Reconstruction and Understanding of People, Objects, and Environments
    Anastasia Tkach
    Christine Kaeser-Chen
    Christoph Rhemann
    Jonathan Taylor
    Julien Valentin
    Kaiwen Guo
    Mingsong Dou
    Sameh Khamis
    Shahram Izadi
    Sofien Bouaziz
    Thomas Funkhouser
    Yinda Zhang
    Preview abstract This is a set of slide decks presenting a full tutorial on 3D capture and reconstruction, with high-level applications on VR and AR. This request is to upload the slides on the tutorial website: https://augmentedperception.github.io/cvpr18/ View details
    ActiveStereoNet: Unsupervised End-to-End Learning for Active Stereo Systems
    Yinda Zhang
    Sameh Khamis
    Christoph Rhemann
    Julien Valentin
    Vladimir Tankovich
    Michael Schoenberg
    Shahram Izadi
    European Conference on Computer Vision (2018)
    Preview abstract In this paper we present ActiveStereoNet, the first deep learning solution for active stereo systems. Due to the lack of ground truth, our method is fully self-supervised, yet it produces precise depth with a subpixel precision of 1/30th of a pixel; it does not suffer from the common over-smoothing issues; it preserves the edges; and it explicitly handles occlusions. We introduce a novel reconstruction loss that is more robust to noise and texture-less patches, and is invariant to illumination changes. The proposed loss is optimized using a window-based cost aggregation with an adaptive support weight scheme. This cost aggregation is edge-preserving and smooths the loss function, which is key to allow the network to reach compelling results. Finally we show how the task of predicting invalid regions, such as occlusions, can be trained end-to-end without ground-truth. This component is crucial to reduce blur and particularly improves predictions along depth discontinuities. Extensive quantitatively and qualitatively evaluations on real and synthetic data demonstrate state of the art results in many challenging scenes. View details
    No Results Found