Jump to Content
Rahul Garg

Rahul Garg

Rahul Garg is a staff research scientist at Google, working on computer vision and computational photography. He received his PhD in Computer Science from University of Washington in 2012 and B. Tech. in Computer Science and Engineering from Indian Institute of Technology (IIT), Delhi in 2007.

Research Areas

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Defocus Map Estimation and Blur Removal from a Single Dual-Pixel Image
    Ioannis Gkioulekas
    Jiawen Chen
    Neal Wadhwa
    Pratul Srinivasan
    Shumian Xin
    Tianfan Xue
    International Conference on Computer Vision (2021)
    Preview abstract We present a method to simultaneously estimate an image's defocus map, i.e., the amount of defocus blur at each pixel, and remove the blur to recover a sharp all-in-focus image using only a single camera capture. Our method leverages data from dual-pixel sensors that are common on many consumer cameras. Though originally designed to assist camera autofocus, dual-pixel sensors have been used to separately recover both defocus maps and all-in-focus images. Past approaches have solved these two problems in isolation and often require large labeled datasets for supervised training. In contrast with those prior works, we show that the two problems are connected, model the optics of dual-pixel images, and set up an optimization problem to jointly solve for both. We use data captured with a consumer smartphone camera to demonstrate that after a one time calibration step, our approach improves upon past approaches for both defocus map estimation and blur removal, without any supervised training. View details
    How to train neural networks for flare removal
    Yicheng Wu
    Tianfan Xue
    Jiawen Chen
    Ashok Veeraraghavan
    ICCV (2021)
    Preview abstract When a camera is pointed at a strong light source, the resulting photograph may contain lens flare artifacts. Flares appear in a wide variety of patterns (halos, streaks, color bleeding, haze, etc.) and this diversity in appearance makes flare removal challenging. Existing analytical solutions make strong assumptions about the artifact’s geometry or brightness, and therefore only work well on a small subset of flares. Machine learning techniques have shown success in removing other types of artifacts, like reflections, but have not been widely applied to flare removal due to the lack of training data. To solve this problem, we explicitly model the optical causes of flare either empirically or using wave optics, and generate semi-synthetic pairs of flare-corrupted and clean images. This enables us to train neural networks to remove lens flare for the first time. Experiments show our data synthesis approach is critical for accurate flare removal, and that models trained with our technique generalize well to real lens flares across different scenes, lighting conditions, and cameras. View details
    Preview abstract Computational stereo has reached a high level of accuracy, but degrades in the presence of occlusions, repeated textures, and correspondence errors along edges. We present a novel approach based on neural networks for depth estimation that combines stereo from dual cameras with stereo from a dual-pixel sensor, which is increasingly common on consumer cameras. Our network uses a novel architecture to fuse these two sources of information and can overcome the above-mentioned limitations of pure binocular stereo matching. Our method provides a dense depth map with sharp edges, which is crucial for computational photography applications like synthetic shallow-depth-of-field or 3D Photos. Additionally, we avoid the inherent ambiguity due to the aperture problem in stereo cameras by designing the stereo baseline to be orthogonal to the dual-pixel baseline. We present experiments and comparisons with state-of-the-art approaches to show that our method offers a substantial improvement over previous works. View details
    Learning to Autofocus
    Charles Herrmann
    Richard Strong Bowen
    Neal Wadhwa
    Ramin Zabih
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
    Preview abstract Autofocus is an important task for digital cameras, yet current approaches often exhibit poor performance. We propose a learning-based approach to this problem, and provide a realistic dataset of sufficient size for effective learning. Our dataset is labeled with per-pixel depths obtained from multi-view stereo, following [9]. Using this dataset, we apply modern deep classification models and an ordinal regression loss to obtain an efficient learning-based autofocus technique. We demonstrate that our approach provides a significant improvement compared with previous learned and non-learned methods: our model reduces the mean absolute error by a factor of 3.6 over the best comparable baseline algorithm. Our dataset and code are publicly available. View details
    Zoom-to-Inpaint: Image Inpainting with High Frequency Details
    Huiwen Chang
    Kfir Aberman
    Munchurl Kim
    Neal Wadhwa
    Nikhil Karnad
    Nori Kanazawa
    Soo Ye Kim
    arXiv (2020)
    Preview abstract Although deep learning has enabled a huge leap forward in image inpainting, current methods are often unable to synthesize realistic high-frequency details. In this paper, we propose applying super resolution to coarsely reconstructed outputs, refining them at high resolution, and then downscaling the output to the original resolution. By introducing high-resolution images to the refinement network, our framework is able to reconstruct finer details that are usually smoothed out due to spectral bias - the tendency of neural networks to reconstruct low frequencies better than high frequencies. To assist training the refinement network on large upscaled holes, we propose a progressive learning technique in which the size of the missing regions increases as training progresses. Our zoom-in, refine and zoom-out strategy, combined with high-resolution supervision and progressive learning, constitutes a framework-agnostic approach for enhancing high-frequency details that can be applied to other inpainting methods. We provide qualitative and quantitative evaluations along with an ablation analysis to show the effectiveness of our approach, which outperforms state-of-the-art inpainting methods. View details
    Wireless Software Synchronization of Multiple Distributed Cameras
    Sam Ansari
    Neal Wadhwa
    Jiawen Chen
    Computational Photography (ICCP), 2019 IEEE International Conference on
    Preview abstract We present a method for precisely time-synchronizing the capture of image sequences from a collection of smartphone cameras connected over WiFi. Our method is entirely software-based, has only modest hardware requirements, and achieves an accuracy of less than 250 microseconds on unmodified commodity hardware. It does not use image content and synchronizes cameras prior to capture. The algorithm operates in two stages. In the first stage, we designate one device as the leader and synchronize each client device's clock to it by estimating network delay. Once clocks are synchronized, the second stage initiates continuous image streaming, estimates the relative phase of image timestamps between each client and the leader, and shifts the streams into alignment. We quantitatively validate our results on a multi-camera rig imaging a high-precision LED array and qualitatively demonstrate significant improvements to multi-view stereo depth estimation and stitching of dynamic scenes. We plan to open-source an Android implementation of our system 'libsoftwaresync', potentially inspiring new types of collective capture applications. View details
    Preview abstract Deep learning techniques have enabled rapid progress in monocular depth estimation, but their quality is limited by the ill-posed nature of the problem and the scarcity of high quality datasets. We estimate depth from a single cam-era by leveraging the dual-pixel auto-focus hardware that is increasingly common on modern camera sensors. Classic stereo algorithms and prior learning-based depth estimation techniques underperform when applied on this dual-pixel data, the former due to too-strong assumptions about RGB image matching, and the latter due to a lack of understanding of the optics of dual-pixel image formation. To allow learning based methods to work well on dual-pixel imagery, we identify an inherent ambiguity in the depth estimated from dual-pixel cues, and develop an approach to estimate depth up to this ambiguity. Using our approach,existing monocular depth estimation techniques can be effectively applied to dual-pixel data, and much smaller models can be constructed that still infer high quality depth. To demonstrate this, we capture a large dataset of in-the-wild 5-viewpoint RGB images paired with corresponding dual-pixel data, and show how view supervision with this data can be used to learn depth up to the unknown ambiguities. On our new task, our model is 30%more accurate than any prior work on learning-based monocular or stereoscopic depth estimation. View details
    Synthetic Depth-of-Field with a Single-Camera Mobile Phone
    Neal Wadhwa
    David E. Jacobs
    Bryan E. Feldman
    Nori Kanazawa
    Robert Carroll
    Marc Levoy
    SIGGRAPH (2018) (to appear)
    Preview abstract Shallow depth-of-field is commonly used by photographers to isolate a subject from a distracting background. However, standard cell phone cameras cannot produce such images optically, as their short focal lengths and small apertures capture nearly all-in-focus images. We present a system to computationally synthesize shallow depth-of-field images with a single mobile camera and a single button press. If the image is of a person, we use a person segmentation network to separate the person and their accessories from the background. If available, we also use dense dual-pixel auto-focus hardware, effectively a 2-sample light field with an approximately 1 millimeter baseline, to compute a dense depth map. These two signals are combined and used to render a defocused image. Our system can process a 5.4 megapixel image in 4 seconds on a mobile phone, is fully automatic, and is robust enough to be used by non-experts. The modular nature of our system allows it to degrade naturally in the absence of a dual-pixel sensor or a human subject. View details
    Aperture Supervision for Monocular Depth Estimation
    Pratul Srinivasan
    Neal Wadhwa
    Ren Ng
    CVPR (2018) (to appear)
    Preview abstract We present a novel method to train machine learning algorithms to estimate scene depths from a single image, by using the information provided by a camera's aperture as supervision. Prior works use a depth sensor's outputs or images of the same scene from alternate viewpoints as supervision, while our method instead uses images from the same viewpoint taken with a varying camera aperture. To enable learning algorithms to use aperture effects as supervision, we introduce two differentiable aperture rendering functions that use the input image and predicted depths to simulate the depth-of-field effects caused by real camera apertures. We train a monocular depth estimation network end-to-end to predict the scene depths that best explain these finite aperture images as defocus-blurred renderings of the input all-in-focus image. View details
    Exploring Photobios
    Ira Kemelmacher-Shlizerman
    Eli Shechtman
    Steven Seitz
    ACM Trans. on Graphics (Proc. SIGGRAPH), vol. 30(4) (2011) (to appear)
    Preview abstract We present an approach for generating face animations from large image collections of the same person. Such collections, which we call photobios, sample the appearance of a person over changes in pose, facial expression, hairstyle, age, and other variations. By optimizing the order in which images are displayed and crossdissolving between them, we control the motion through face space and create compelling animations (e.g., render a smooth transition from frowning to smiling). Used in this context, the cross dissolve produces a very strong motion effect; a key contribution of the paper is to explain this effect and analyze its operating range. The approach operates by creating a graph with faces as nodes, and similarities as edges, and solving for walks and shortest paths on this graph. The processing pipeline involves face detection, locating fiducials (eyes/nose/mouth), solving for pose, warping to frontal views, and image comparison based on Local Binary Patterns. We demonstrate results on a variety of datasets including time-lapse photography, personal photo collections, and images of celebrities downloaded from the Internet. Our approach is the basis for the Face Movies feature in Google’s Picasa. View details
    Where's Waldo: Matching People in Images of Crowds
    Deva Ramanan
    Steven M. Seitz
    Noah Snavely
    Proc. IEEE Conf. on Computer Vision and Pattern Recognition (2011), pp. 1793-1800
    Preview abstract Given a community-contributed set of photos of a crowded public event, this paper addresses the problem of finding all images of each person in the scene. This problem is very challenging due to large changes in camera viewpoints, severe occlusions, low resolution and photos from tens or hundreds of different photographers. Despite these challenges, the problem is made tractable by exploiting a variety of visual and contextual cues – appearance, timestamps, camera pose and co-occurrence of people. This paper demonstrates an approach that integrates these cues to enable high quality person matching in community photo collections downloaded from Flickr.com View details
    No Results Found