Jump to Content
Abhishek Kar

Abhishek Kar

I am currently a Research Scientist in the Augmented Reality team at Google where I work on problems at the intersection of 3D computer vision and machine learning. Prior to Google, I was the Machine Learning Lead at Fyusion Inc., a 3D computational photography startup based in San Francisco. I graduated from UC Berkeley in 2017 from Jitendra Malik's group working on Machine Learning and 3D Computer Vision. I have also spent time at Microsoft Research working on viewing large imagery on mobile devices and with the awesome team at Fyusion capturing "3D photos" with mobile devices and developing deep learning models for them. Some features I have shipped/worked on at Fyusion include 3D visual search, creation of user generated AR/VR content, real-time style transfer on mobile devices and automatic damage analysis on cars.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    LU-NeRF: Scene and Pose Estimation by Synchronizing Local Unposed NeRFs
    Zezhou Cheng
    Varun Jampani
    Subhransu Maji
    International Conference on Computer Vision (ICCV) (2023)
    Preview abstract A critical obstacle preventing NeRF models from being deployed broadly in the wild is their reliance on accurate camera poses. Consequently, there is growing interest in extending NeRF models to jointly optimize camera poses and scene representation, which offers an alternative to offthe-shelf SfM pipelines which have well-understood failure modes. Existing approaches for unposed NeRF operate under limiting assumptions, such as a prior pose distribution or coarse pose initialization, making them less effective in a general setting. In this work, we propose a novel approach, LU-NeRF, that jointly estimates camera poses and neural radiance fields with relaxed assumptions on pose configuration. Our approach operates in a local-to-global manner, where we first optimize over local subsets of the data, dubbed “mini-scenes.” LU-NeRF estimates local pose and geometry for this challenging few-shot task. The mini-scene poses are brought into a global reference frame through a robust pose synchronization step, where a final global optimization of pose and scene can be performed. We show our LU-NeRF pipeline outperforms prior attempts at unposed NeRF without making restrictive assumptions on the pose prior. This allows us to operate in the general SE(3) pose setting, unlike the baselines. Our results also indicate our model can be complementary to feature-based SfM pipelines as it compares favorably to COLMAP on lowtexture and low-resolution images. View details
    Rapsai: Accelerating Machine Learning Prototyping of Multimedia Applications through Visual Programming
    Na Li
    Jing Jin
    Michelle Carney
    Scott Joseph Miles
    Maria Kleiner
    Xiuxiu Yuan
    Anuva Kulkarni
    Xingyu “Bruce” Liu
    Ahmed K Sabie
    Ping Yu
    Ram Iyengar
    Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI), ACM
    Preview abstract In recent years, there has been a proliferation of multimedia applications that leverage machine learning (ML) for interactive experiences. Prototyping ML-based applications is, however, still challenging, given complex workflows that are not ideal for design and experimentation. To better understand these challenges, we conducted a formative study with seven ML practitioners to gather insights about common ML evaluation workflows. This study helped us derive six design goals, which informed Rapsai, a visual programming platform for rapid and iterative development of end-to-end ML-based multimedia applications. Rapsai is based on a node-graph editor to facilitate interactive characterization and visualization of ML model performance. Rapsai streamlines end-to-end prototyping with interactive data augmentation and model comparison capabilities in its no-coding environment. Our evaluation of Rapsai in four real-world case studies (N=15) suggests that practitioners can accelerate their workflow, make more informed decisions, analyze strengths and weaknesses, and holistically evaluate model behavior with real-world input. View details
    ASIC: Aligning Sparse in-the-wild Image Collections
    Kamal Gupta
    Varun Jampani
    Abhinav Shrivastava
    International Conference on Computer Vision (ICCV) (2023)
    Preview abstract We present a method for joint alignment of sparse in-thewild image collections of an object category. Most prior works assume either ground-truth keypoint annotations or a large dataset of images of a single object category. However, neither of the above assumptions hold true for the longtail of the objects present in the world. We present a selfsupervised technique that directly optimizes on a sparse collection of images of a particular object/object category to obtain consistent dense correspondences across the collection. We use pairwise nearest neighbors obtained from deep features of a pre-trained vision transformer (ViT) model as noisy and sparse keypoint matches and make them dense and accurate matches by optimizing a neural network that jointly maps the image collection into a learned canonical grid. Experiments on CUB and SPair-71k benchmarks demonstrate that our method can produce globally consistent and higher quality correspondences across the image collection when compared to existing self-supervised methods. Code and other material will be made available at https://kampta.github.io/asic. View details
    Preview abstract Single image 3D photography enables viewers to view a still image from novel viewpoints. Recent approaches for single-image view synthesis combine monocular depth network along with inpainting networks resulting in compelling novel view synthesis results. A drawback of these approaches is the use of hard layering making them not suitable to model intricate appearance effects such as matting. We present SLIDE, a modular and unified system for single image 3D photography that uses simple yet effective soft layering strategy to model appearance effects. In addition, we propose a novel depth-aware training of inpainting network suitable for 3D photography task. Extensive experimental analysis on 3 different view synthesis datasets in combination with user studies on in-the-wild image collections demonstrate the superior performance of our technique in comparison to existing strong baselines. View details
    Free-Viewpoint Facial Re-Enactment from a Casual Capture
    Srinivas Rao
    Rodrigo Ortiz-Cayon
    Matteo Munaro
    Aidas Liaudanskas
    Krunal Chande
    Tobias Bertel
    Christian Richardt
    Alexander JB Trevor
    Stefan Holzer
    SIGGRAPH Asia 2020 Posters, Association for Computing Machinery, Virtual Event, Republic of Korea
    Preview abstract We propose a system for free-viewpoint facial re-enactment from a casual video capture of a target subject. Our system can render and re-enact the subject consistently in all the captured views. Furthermore, our system also enables interactive free-viewpoint facial re-enactment of the target from novel views. The re-enactment of the target subject is driven by an expression sequence of a source subject, which is captured using a custom app running on an iPhone X. Our system handles large pose variations in the target subject while keeping the re-enactment consistent. We demonstrate the efficacy of our system by showing various applications. View details
    Learning Independent Object Motion from Unlabelled Stereoscopic Videos
    Zhe Cao
    Christian Häne
    Jitendra Malik
    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
    Preview abstract We present a system for learning motion maps of independently moving objects from stereo videos. The only annotations used in our system are 2D object bounding boxes which introduce the notion of objects in our system. Unlike prior learning based approaches which have focused on predicting dense optical flow fields and/or depth maps for images, we propose to predict instance specific 3D scene flow maps and instance masks from which we derive a factored 3D motion map for each object instance. Our network takes the 3D geometry of the problem into account which allows it to correlate the input images and distinguish moving objects from static ones. We present experiments evaluating the accuracy of our 3D flow vectors, as well as depth maps and projected 2D optical flow where our jointly learned system outperforms earlier approaches trained for each task independently. View details
    No Results Found