Avneesh Sud
Research Areas
Authored Publications
Sort By
Preview abstract
In the era of deep learning, human pose estimation from multiple cameras with unknown calibration has received little attention to date. We show how to train a neural model to perform this task with high precision and minimal latency overhead. The proposed model takes into account joint location uncertainty due to occlusion from multiple views, and requires only 2D keypoint data for training. Our method outperforms both classical bundle adjustment and weakly-supervised monocular 3D baselines on the well-established Human3.6M dataset, as well as the more challenging in-the-wild Ski-Pose PTZ dataset.
View details
Preview abstract
Distribution alignment has many applications in deep learning, including domain adaptation and unsupervised image-to-image translation. Most prior work on unsupervised distribution alignment relies either on minimizing simple non-parametric statistical distances such as maximum mean discrepancy or on adversarial alignment. However, the former fails to capture the structure of complex real-world distributions, while the latter is difficult to train and does not provide any universal convergence guarantees or automatic quantitative validation procedures. In this paper, we propose a new distribution alignment method based on a log-likelihood ratio statistic and normalizing flows. We show that, under certain assumptions, this combination yields a deep neural likelihood-based minimization objective that attains a known lower bound upon convergence. We experimentally verify that minimizing the resulting objective results in domain alignment that preserves the local structure of input domains.
View details
Preview abstract
The goal of this project is to learn a 3D shape representation that enables accurate surface reconstruction, compact storage, efficient computation, consistency for similar shapes, generalization across diverse shape categories, and inference from depth camera observations. Towards this end, we introduce Local Deep Implicit Functions (LDIF), a 3D shape representation that decomposes space into a structured set of learned implicit functions. We provide networks that infer the space decomposition and local deep implicit functions from a 3D mesh or posed depth image. During experiments, we find that it provides 10.3 points higher surface reconstruction accuracy (F-Score) than the state-of-the-art (OccNet), while requiring fewer than 1 percent of the network parameters. Experiments on posed depth image completion and generalization to unseen classes show 15.8 and 17.8 point improvements over the state-of-the-art, while producing a structured 3D representation for each input with consistency across diverse shape collections.
View details
Cross-Domain 3D Equivariant Image Embeddings
Zhengyi Luo
Kostas Daniilidis
Proceedings of the 36th International Conference on Machine Learning, 2019, PMLR
Preview abstract
Spherical convolutional neural networks have been introduced recently as a tool to learn powerful feature representations of 3D shapes. Since spherical convolutions are equivariant to 3D rotations, the latent space of a SphericalCNN provides a natural representation for applications where 3D data may be observed in arbitrary orientations.
In this paper we explore if it is possible to learn 2D image embeddings with a similar equivariant structure: embedding the image of a 3D object should commute with rotations of the object. Our proposal is to bootstrap our model with supervision from a Spherical CNN pretrained with 3D shapes. Given an equivariant latent representation for 3D shapes, we introduce a novel supervised cross-domain embedding architecture that learns to map 2D images into the Spherical CNN's latent space. Our model is only optimized to produce the embeddings from an image's corresponding 3D shape. The trained model learns to encode images with 3D shape properties and is equivariant to 3D rotations of the observed object.
We show that learning only a rich embedding for images with appropriate geometric structure is in and of itself sufficient for tackling numerous applications. We show evidence from two different applications, relative pose estimation and novel view synthesis. In both settings we demonstrate that equivariant embeddings are sufficient for the application without requiring any task-specific supervised training.
View details
Preview abstract
Virtual Reality (VR) has advanced significantly in recent years and allows users to explore novel environments (both real and imaginary), play games, and engage with media in a way that is unprecedentedly immersive. However, compared to physical reality, sharing these experiences is difficult because the user's virtual environment is not easily observable from the outside and the user's face is partly occluded by the VR headset. Mixed Reality (MR) is a medium that alleviates some of this disconnect by sharing the virtual context of a VR user in a flat video format that can be consumed by an audience to get a feel for the user's experience.
Even though MR allows audiences to connect actions of the VR user with their virtual environment, empathizing with them is difficult because their face is hidden by the headset. We present a solution to address this problem by virtually removing the headset and revealing the face underneath it using a combination of 3D vision, machine learning and graphics techniques. We have integrated our headset removal approach with Mixed Reality, and demonstrate results on several VR games and experiences.
View details
Eyemotion: Classifying facial expressions in VR using eye-tracking cameras
Nick Dufour
arXiv, https://arxiv.org/abs/1707.07204 (2017)
Preview abstract
One of the main challenges of social interaction in virtual reality settings is that head-mounted displays occlude a large portion of the face, blocking facial expressions and thereby restricting social engagement cues among users. Hence, auxiliary means of sensing and conveying these expressions are needed. We present an algorithm to automatically infer expressions by analyzing only a partially occluded face while the user is engaged in a virtual reality experience. Specifically, we show that images of the user's eyes captured from an IR gaze-tracking camera within a VR headset are sufficient to infer a select subset of facial expressions without the use of any fixed external camera. Using these inferences, we can generate dynamic avatars in real-time which function as an expressive surrogate for the user. We propose a novel data collection pipeline as well as a novel approach for increasing CNN accuracy via personalization. Our results show a mean accuracy of 74% (F1 of 0.73) among 5 `emotive' expressions and a mean accuracy of 70% (F1 of 0.68) among 10 distinct facial action units, outperforming human raters.
View details