Jump to Content
Ameesh Makadia

Ameesh Makadia

I am currently a Research Scientist with Google in NYC. My research interests lie at the intersection of Computer Vision and Machine Learning. For more details and a complete publication list please visit my personal site.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Scaling Spherical CNNs
    Jean-Jacques Slotine
    International Conference on Machine Learning (ICML) (2023)
    Preview abstract Spherical CNNs generalize CNNs to functions on the sphere, by using spherical convolutions as the main linear operation. The most accurate and efficient way to compute spherical convolutions is in the spectral domain (via the convolution theorem), but this is still much more costly than the usual planar convolutions. For this reason, applications of spherical CNNs have so far been limited to small problems that can be approached with low model capacity. In this work, we show how spherical CNNs can be scaled for much larger problems. To achieve this, we made critical improvements including an implementation of core operations to exploit hardware accelerator characteristics, introducing novel variants of common model components, and showing how to construct application-specific input representations that exploit the properties of our model. Experiments show our larger spherical CNNs reach state-of-the-art on several targets of the QM9 molecular benchmark, which was previously dominated by equivariant graph neural networks, and achieve competitive performance on multiple weather forecasting tasks. View details
    NAVI: Category-Agnostic Image Collections with High-Quality 3D Shape and Pose Annotations
    Varun Jampani
    Andreas Engelhardt
    Arjun Karpur
    Karen Truong
    Kyle Sargent
    Ricardo Martin-Brualla
    Kaushal Patel
    Daniel Vlasic
    Vittorio Ferrari
    Ce Liu
    Neural Information Processing Systems (NeurIPS) (2023)
    Preview abstract Recent advances in neural reconstruction enable high-quality 3D object reconstruction from casually captured image collections. Current techniques mostly analyze their progress on relatively simple image collections where SfM techniques can provide ground-truth (GT) camera poses. We note that SfM techniques tend to fail on in-the-wild image collections such as image search results with varying backgrounds and illuminations. To enable systematic research progress on 3D reconstruction from casual image captures, we propose a new dataset of image collections called `NAVI' consisting of category-agnostic image collections of objects with high-quality 3D scans along with per-image 2D-3D alignments providing near-perfect GT camera parameters. These 2D-3D alignments allows to extract derivative annotations such as dense pixel correspondences, depth and segmentation maps. We demonstrate the use of NAVI image collections on different problem settings and show that NAVI enables more thorough evaluations that were not possible with existing datasets. We believe NAVI is beneficial for systematic research progress on 3D reconstruction and correspondence estimation. Project page: \url{https://navidataset.github.io} View details
    Learning to Transform for Generalizable Instance-wise Invariance
    Utkarsh Singhal
    Stella Yu
    International Conference on Compute Vision (2023)
    Preview abstract Computer vision research has long aimed to build systems that are robust to spatial transformations found in natural data. Traditionally, this is done using data augmentation or hard-coding invariances into the architecture. However, too much or too little invariance can hurt, and the correct amount is unknown a priori and dependent on the instance. Ideally, the appropriate invariance would be learned from data and inferred at test-time. We treat invariance as a prediction problem. Given any image, we use a normalizing flow to predict a distribution over transformations and average the predictions over them. Since this distribution only depends on the instance, we can align instances before classifying them and generalize invariance across classes. The same distribution can also be used to adapt to out-of-distribution poses. This normalizing flow is trained end-to-end and can learn a much larger range of transformations than Augerino and InstaAug. When used as data augmentation, our method shows accuracy and robustness gains on CIFAR 10, CIFAR10-LT, and TinyImageNet. View details
    ASIC: Aligning Sparse in-the-wild Image Collections
    Kamal Gupta
    Varun Jampani
    Abhinav Shrivastava
    International Conference on Computer Vision (ICCV) (2023)
    Preview abstract We present a method for joint alignment of sparse in-thewild image collections of an object category. Most prior works assume either ground-truth keypoint annotations or a large dataset of images of a single object category. However, neither of the above assumptions hold true for the longtail of the objects present in the world. We present a selfsupervised technique that directly optimizes on a sparse collection of images of a particular object/object category to obtain consistent dense correspondences across the collection. We use pairwise nearest neighbors obtained from deep features of a pre-trained vision transformer (ViT) model as noisy and sparse keypoint matches and make them dense and accurate matches by optimizing a neural network that jointly maps the image collection into a learned canonical grid. Experiments on CUB and SPair-71k benchmarks demonstrate that our method can produce globally consistent and higher quality correspondences across the image collection when compared to existing self-supervised methods. Code and other material will be made available at https://kampta.github.io/asic. View details
    LU-NeRF: Scene and Pose Estimation by Synchronizing Local Unposed NeRFs
    Zezhou Cheng
    Varun Jampani
    Subhransu Maji
    International Conference on Computer Vision (ICCV) (2023)
    Preview abstract A critical obstacle preventing NeRF models from being deployed broadly in the wild is their reliance on accurate camera poses. Consequently, there is growing interest in extending NeRF models to jointly optimize camera poses and scene representation, which offers an alternative to offthe-shelf SfM pipelines which have well-understood failure modes. Existing approaches for unposed NeRF operate under limiting assumptions, such as a prior pose distribution or coarse pose initialization, making them less effective in a general setting. In this work, we propose a novel approach, LU-NeRF, that jointly estimates camera poses and neural radiance fields with relaxed assumptions on pose configuration. Our approach operates in a local-to-global manner, where we first optimize over local subsets of the data, dubbed “mini-scenes.” LU-NeRF estimates local pose and geometry for this challenging few-shot task. The mini-scene poses are brought into a global reference frame through a robust pose synchronization step, where a final global optimization of pose and scene can be performed. We show our LU-NeRF pipeline outperforms prior attempts at unposed NeRF without making restrictive assumptions on the pose prior. This allows us to operate in the general SE(3) pose setting, unlike the baselines. Our results also indicate our model can be complementary to feature-based SfM pipelines as it compares favorably to COLMAP on lowtexture and low-resolution images. View details
    Generalizable Patch-Based Neural Rendering
    Leonid Sigal
    European Conference on Computer Vision (2022) (to appear)
    Preview abstract Neural rendering has received tremendous attention since the advent of Neural Radiance Fields (NeRF), and has pushed the state-of-the-art on novel-view synthesis considerably. The recent focus has been on models that overfit to a single scene, and the few attempts to learn models that can synthesize novel views of unseen scenes mostly consist of combining deep convolutional features with a NeRF-like model. We propose a different paradigm, where no deep visual features and no NeRF-like volume rendering are needed. Our method is capable of predicting the color of a target ray in a novel scene directly, just from a collection of patches sampled from the scene. We first leverage epipolar geometry to extract patches along the epipolar lines of each reference view. Each patch is linearly projected into a 1D feature vector and a sequence of transformers process the collection. For positional encoding, we parameterize rays as in a light field representation, with the crucial difference that the coordinates are canonicalized with respect to the target ray, which makes our method independent of the reference frame and improves generalization. We show that our approach outperforms the state-of-the-art on novel view synthesis of unseen scenes even when being trained with considerably less data than prior work. Our code is available at https://mohammedsuhail.net/gen_patch_neural_rendering. View details
    Learning ABCs: Approximate Bijective Correspondence for isolating factors of variation with weak supervision
    Kieran Alexander Murphy
    Varun Jampani
    Computer vision and pattern recognition (CVPR) 2022 (to appear)
    Preview abstract Representational learning forms the backbone of most deep learning applications, and the value of a learned representation depends on its information content about the different factors of variation. Learning good representations is intimately tied to the nature of supervision and the learning algorithm. We propose a novel algorithm that relies on a weak form of supervision where the data is partitioned into sets according to certain \textit{inactive} factors of variation. Our key insight is that by seeking approximate correspondence between elements of different sets, we learn strong representations that exclude the inactive factors of variation and isolate the \textit{active} factors which vary within all sets. We demonstrate that the method can work in a semi-supervised scenario, and that a portion of the unsupervised data can belong to a different domain entirely, as long as the same active factors of variation are present. By folding in data augmentation to suppress additional nuisance factors, we are able to further control the content of the learned representations. We outperform competing baselines on the challenging problem of synthetic-to-real object pose transfer. View details
    Light Field Neural Rendering
    Leonid Sigal
    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
    Preview abstract Classical light field rendering for novel view synthesis can accurately reproduce view-dependent effects such as reflection, refraction, and translucency, but requires a dense view sampling of the scene. Methods based on geometric reconstruction need only sparse views, but cannot accurately model non-Lambertian effects. We introduce a model that combines the strengths and mitigates the limitations of these two directions. By operating on a four-dimensional representation of the light field, our model learns to represent view-dependent effects accurately. By enforcing geometric constraints during training and inference, the scene geometry is implicitly learned from a sparse set of views. Concretely, we introduce a two-stage transformer-based model that first aggregates features along epipolar lines, then aggregates features along reference views to produce the color of a target ray. Our model outperforms the state-of-the-art on multiple forward-facing and 360◦ datasets, with larger margins on scenes with severe view-dependent variations. Code and results can be found at light-field-neural- rendering.github.io. View details
    Preview abstract We introduce the problem of perpetual view generation—long-range generation of novel views corresponding to an arbitrarily long camera trajectory given a single image. This is a challenging problem that goes far beyond the capabilities of current view synthesis methods, which work for a limited range of viewpoints and quickly degenerate when presented with a large camera motion. Methods designed for video generation also have limited ability to produce long video sequences and are often agnostic to scene geometry. We take a hybrid approach that integrates both geometry and image synthesis in an iterative render, refine, and repeat framework, allowing for long-range generation that cover large distances after hundreds of frames. Our approach can be trained from a set of monocular video sequences without any manual annotation. We propose a dataset of aerial footage of natural coastal scenes, and compare our method with recent view synthesis and conditional video generation baselines, showing that it can generate plausible scenes for much longer time horizons over large camera trajectories compared to existing methods. View details
    Preview abstract Single image pose estimation is a fundamental problem in many vision and robotics tasks, and existing deep learning approaches suffer by not completely modeling and handling: i) uncertainty about the predictions, and ii) symmetric objects with multiple (sometimes infinite) correct poses. To this end, we introduce a method to estimate arbitrary, non-parametric distributions on SO(3). Our key idea is to represent the distributions implicitly, with a neural network that estimates the probability given the input image and a candidate pose. Grid sampling or gradient ascent can be used to find the most likely pose, but it is also possible to evaluate the probability at any pose, enabling reasoning about symmetries and uncertainty. This is the most general way of representing distributions on manifolds, and to showcase the rich expressive power, we introduce a dataset of challenging symmetric and nearly-symmetric objects. We require no supervision on pose uncertainty – the model trains only with a single pose per example. Nonetheless, our implicit model is highly expressive to handle complex distributions over 3D poses, while still obtaining accurate pose estimation on standard non-ambiguous environments, achieving state-of-the-art performance on Pascal3D+ and ModelNet10-SO(3) benchmarks. Code, data, and visualizations may be found at implicit-pdf.github.io. View details
    KeypointDeformer: Unsupervised 3D Keypoint Discovery for Shape Control
    Tomas Jakab
    Jiajun Wu
    Angjoo Kanazawa
    Computer Vision and Pattern Recognition (CVPR) (2021)
    Preview abstract We present KeypointDeformer, a novel unsupervised method for shape control through automatically discovered 3D keypoints. Our approach produces intuitive and semantically consistent control of shape deformations. Moreover, our discovered 3D keypoints are consistent across object category instances despite large shape variations. Since our method is unsupervised, it can be readily deployed to new object categories without requiring expensive annotations for 3D keypoints and deformations. View details
    De-rendering the World’s Revolutionary Artefacts
    Elliott Wu
    Jiajun Wu
    Angjoo Kanazawa
    Computer Vision and Pattern Recognition (CVPR) (2021)
    Preview abstract Recent works have shown exciting results in unsupervised image de-rendering—learning to decompose 3D shape, appearance, and lighting from single-image collections without explicit supervision. However, many of these assume simplistic material and lighting models. We propose a method, termed RADAR (Revolutionary Artefact De-rendering And Re-rendering), that can recover environment illumination and surface materials from real single-image collections, relying neither on explicit 3D supervision, nor on multi-view or multi-light images. Specifically, we focus on rotationally symmetric artefacts that exhibit challenging surface properties including specular reflections, such as vases. We introduce a novel self-supervised albedo discriminator, which allows the model to recover plausible albedo without requiring any ground-truth during training. In conjunction with a shape reconstruction module exploiting rotational symmetry, we present an end-to-end learning framework that is able to de-render the world's revolutionary artefacts. We conduct experiments on a real vase dataset and demonstrate compelling decomposition results, allowing for applications including free-viewpoint rendering and relighting. View details
    An Analysis of SVD for Deep Rotation Estimation
    Jake Levinson
    Arthur Chen
    Angjoo Kanazawa
    Advances in Neural Information Processing Systems (NeurIPS) 2020
    Preview abstract Symmetric orthogonalization via SVD, and closely related procedures, are well-known techniques for projecting matrices onto O(n) or SO(n). These tools have long been used for applications in computer vision, for example optimal 3D alignment problems solved by orthogonal Procrustes, rotation averaging, or Essential matrix decomposition. Despite its utility in different settings, SVD orthogonalization as a procedure for producing rotation matrices is typically overlooked in deep learning models, where the preferences tend toward classic representations like unit quaternions, Euler angles, and axis-angle, or more recently-introduced methods. Despite the importance of 3D rotations in computer vision and robotics, a single universally effective representation is still missing. Here, we explore the viability of SVD orthogonalization for 3D rotations in neural networks. We present a theoretical analysis of SVD as used for projection onto the rotation group. Our extensive quantitative analysis shows simply replacing existing representations with the SVD orthogonalization procedure obtains state of the art performance in many deep learning applications covering both supervised and unsupervised training. View details
    Spin-Weighted Spherical CNNs
    Kostas Daniilidis
    Advances in Neural Information Processing Systems (NeurIPS) 2020
    Preview abstract Learning equivariant representations is a promising way to reduce sample and model complexity and improve the generalization performance of deep neural networks. The spherical CNNs are successful examples, producing SO(3)-equivariant representations of spherical inputs. There are two main types of spherical CNNs. The first type lifts the inputs to functions on the rotation group SO(3) and applies convolutions on the group, which are computationally expensive since SO(3) has one extra dimension. The second type applies convolutions directly on the sphere, which are limited to zonal (isotropic) filters, and thus have limited expressivity. In this paper, we present a new type of spherical CNN that allows anisotropic filters in an efficient way, without ever leaving the spherical domain. The key idea is to consider spin-weighted spherical functions, which were introduced in physics in the study of gravitational waves. These are complex-valued functions on the sphere whose phases change upon rotation. We define a convolution between spin-weighted functions and build a CNN based on it. The spin-weighted functions can also be interpreted as spherical vector fields, allowing applications to tasks where the inputs or outputs are vector fields. Experiments show that our method outperforms previous methods on tasks like classification of spherical images, classification of 3D shapes and semantic segmentation of spherical panoramas. View details
    Cross-Domain 3D Equivariant Image Embeddings
    Zhengyi Luo
    Kostas Daniilidis
    Proceedings of the 36th International Conference on Machine Learning, 2019, PMLR
    Preview abstract Spherical convolutional neural networks have been introduced recently as a tool to learn powerful feature representations of 3D shapes. Since spherical convolutions are equivariant to 3D rotations, the latent space of a SphericalCNN provides a natural representation for applications where 3D data may be observed in arbitrary orientations. In this paper we explore if it is possible to learn 2D image embeddings with a similar equivariant structure: embedding the image of a 3D object should commute with rotations of the object. Our proposal is to bootstrap our model with supervision from a Spherical CNN pretrained with 3D shapes. Given an equivariant latent representation for 3D shapes, we introduce a novel supervised cross-domain embedding architecture that learns to map 2D images into the Spherical CNN's latent space. Our model is only optimized to produce the embeddings from an image's corresponding 3D shape. The trained model learns to encode images with 3D shape properties and is equivariant to 3D rotations of the observed object. We show that learning only a rich embedding for images with appropriate geometric structure is in and of itself sufficient for tackling numerous applications. We show evidence from two different applications, relative pose estimation and novel view synthesis. In both settings we demonstrate that equivariant embeddings are sufficient for the application without requiring any task-specific supervised training. View details
    Labeling Panoramas with Spherical Hourglass Networks
    Kostas Daniilidis
    360-degree Perception and Interaction Workshop at ECCV18 (2018)
    Learning SO(3) Equivariant Representations with Spherical CNNs
    Carlos Esteves
    Christine Allen-Blanchette
    Kostas Daniilidis
    ECCV 2018 (2018)
    Deformable Shape Completion with Graph Convolutional Autoencoders
    Or Litany
    Alex Bronstein
    Michael Bronstein
    CVPR 2018 (to appear)
    Preview abstract The availability of affordable and portable depth sensors has made scanning objects and people simpler than ever. However, dealing with occlusions and missing parts is still a significant challenge. The problem of reconstructing a (possibly non-rigidly moving) 3D object from a single or multiple partial scans has received increasing attention in recent years. In this work, we propose a novel learning-based method for the completion of partial shapes. Unlike the majority of existing approaches, our method focuses on objects that can undergo non-rigid deformations. The core of our method is a variational autoencoder with graph convolutional operations that learns a latent space for complete realistic shapes. At inference, we optimize to find the representation in this latent space that best fits the generated shape to the known partial input. The completed shape exhibits a realistic appearance on the unknown part. We show promising results towards the completion of synthetic and real scans of human body and face meshes exhibiting different styles of articulation and partiality. View details
    Geometry of 3D Environments and Sum of Squares Polynomials
    Ameer Ali Ahmadi
    Georgina Hall
    Robotics: Science and Systems (2017)
    Preview abstract Motivated by applications in robotics and computer vision, we study problems related to spatial reasoning of a 3D environment using sublevel sets of polynomials. These include: tightly containing a cloud of points (e.g., representing an obstacle) with convex or nearly-convex basic semialgebraic sets, computation of Euclidean distances between two such sets, separation of two convex basic semalgebraic sets that overlap, and tight containment of the union of several basic semialgebraic sets with a single convex one. We use algebraic techniques from sum of squares optimization that reduce all these tasks to semidefinite programs of small size and present numerical experiments in realistic scenarios. View details
    Learning 3D Part Detection from Sparsely Labeled Data
    Mehmet Ersin Yumer
    2nd International Conference on 3D Vision, 2014 (2014)
    Co-Segmentation of Textured 3D Shapes with Sparse Annotations
    M. Ersin Yumer
    Computer Vision and Pattern Recognition (CVPR) (2014)
    Label Partitioning for Sublinear Ranking
    Jason Weston
    International Conference on Machine Learning (2013)
    Baselines for Image Annotation
    Vladimir Pavlovic
    International Journal on Computer Vision (IJCV) (2010)
    Shape-based Object Recognition in Videos Using 3D Synthetic Object Models
    Alexander Toshev
    Kostas Daniilidis
    Computer Vision and Pattern Recognition (2009)
    A New Baseline For Image Annotation
    Vladimir Pavlovic
    European Conference on Computer Vision (ECCV) (2008)
    No Results Found