Vivek Kwatra
Research Areas
Authored Publications
Sort By
Synthesis-Assisted Video Prototyping From a Document
Brian R. Colonna
Christian Frueh
UIST 2022: ACM Symposium on User Interface Software and Technology (2022)
Preview abstract
Video productions commonly start with a script, especially for talking head videos that feature a speaker narrating to the camera. When the source materials come from a written document -- such as a web tutorial, it takes iterations to refine content from a text article to a spoken dialogue, while considering visual compositions in each scene. We propose Doc2Video, a video prototyping approach that converts a document to interactive scripting with a preview of synthetic talking head videos. Our pipeline decomposes a source document into a series of scenes, each automatically creating a synthesized video of a virtual instructor. Designed for a specific domain -- programming cookbooks, we apply visual elements from the source document, such as a keyword, a code snippet or a screenshot, in suitable layouts. Users edit narration sentences, break or combine sections, and modify visuals to prototype a video in our Editing UI. We evaluated our pipeline with public programming cookbooks. Feedback from professional creators shows that our method provided a reasonable starting point to engage them in interactive scripting for a narrated instructional video.
View details
LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video using Pose and Lighting Normalization
Avisek Lahiri
Christian Frueh
John Lewis
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021) (to appear)
Preview abstract
In this paper, we present a video-based learning framework for animating personalized 3D talking faces from audio. We introduce two training-time data normalizations that significantly improve data sample efficiency. First, we isolate and represent faces in a normalized space that decouples 3D geometry, head pose, and texture. This decomposes the prediction problem into regressions over the 3D face shape and the corresponding 2D texture atlas. Second, we leverage facial symmetry and approximate albedo constancy of skin to isolate and remove spatio-temporal lighting variations. Together, these normalizations allow simple networks to generate high fidelity lip-sync videos under novel ambient illumination while training with just a single speaker-specific video. Further, to stabilize temporal dynamics, we introduce an auto-regressive approach that conditions the model on its previous visual state. Human ratings and objective metrics demonstrate that our method outperforms contemporary state-of-the-art audio-driven video reenactment benchmarks in terms of realism, lip-sync and visual quality scores. We illustrate several applications enabled by our framework.
View details
Preview abstract
The use of 360 degree cameras, enabling one to record and share full-spherical $360^\circ \times 180^\circ$ view without any cropping in the viewing angle, is on the rise. Shake in such videos is problematic, especially when used in conjunction with VR headsets causing cyber sickness to the viewer. The current state of the art video stabilization algorithm \cite{kopf16} designed specifically for 360 degree videos consider the special geometrical constraints in such videos. However, the specific steps in the algorithm can abruptly change the viewing direction in a video leading to unnatural experience for the viewer. In this paper, we propose to fix the anomaly by the use of $L1$ smoothness constraints on the camera path, as suggested by Grundmann \etal \cite{grundmann11}. The modified algorithm is generic and our experiments indicate that the proposed algorithm not only gives a more natural and smoother stabilization for 360 degree videos but can be used for stabilizing normal field of view videos as well.
View details
Preview abstract
Virtual Reality (VR) has advanced significantly in recent years and allows users to explore novel environments (both real and imaginary), play games, and engage with media in a way that is unprecedentedly immersive. However, compared to physical reality, sharing these experiences is difficult because the user's virtual environment is not easily observable from the outside and the user's face is partly occluded by the VR headset. Mixed Reality (MR) is a medium that alleviates some of this disconnect by sharing the virtual context of a VR user in a flat video format that can be consumed by an audience to get a feel for the user's experience.
Even though MR allows audiences to connect actions of the VR user with their virtual environment, empathizing with them is difficult because their face is hidden by the headset. We present a solution to address this problem by virtually removing the headset and revealing the face underneath it using a combination of 3D vision, machine learning and graphics techniques. We have integrated our headset removal approach with Mixed Reality, and demonstrate results on several VR games and experiences.
View details
Eyemotion: Classifying facial expressions in VR using eye-tracking cameras
Nick Dufour
arXiv, https://arxiv.org/abs/1707.07204 (2017)
Preview abstract
One of the main challenges of social interaction in virtual reality settings is that head-mounted displays occlude a large portion of the face, blocking facial expressions and thereby restricting social engagement cues among users. Hence, auxiliary means of sensing and conveying these expressions are needed. We present an algorithm to automatically infer expressions by analyzing only a partially occluded face while the user is engaged in a virtual reality experience. Specifically, we show that images of the user's eyes captured from an IR gaze-tracking camera within a VR headset are sufficient to infer a select subset of facial expressions without the use of any fixed external camera. Using these inferences, we can generate dynamic avatars in real-time which function as an expressive surrogate for the user. We propose a novel data collection pipeline as well as a novel approach for increasing CNN accuracy via personalization. Our results show a mean accuracy of 74% (F1 of 0.73) among 5 `emotive' expressions and a mean accuracy of 70% (F1 of 0.68) among 10 distinct facial action units, outperforming human raters.
View details
Preview abstract
Personal photo albums are heavily biased towards faces of people, but most state-of-the-art algorithms for image denoising and noise estimation do not exploit facial information. We propose a novel technique for jointly estimating noise levels of all face images in a photo collection. Photos in a personal album are likely to contain several faces of the same people. While some of these photos would be clean and high quality, others may be corrupted by noise. Our key idea is to estimate noise levels by comparing multiple images of the same content that differ predominantly in their noise content. Specifically, we compare geometrically and photometrically aligned face images of the same person.
Our estimation algorithm is based on a probabilistic formulation that seeks to maximize the joint probability of estimated noise levels across all images. We propose an approximate solution that decomposes this joint maximization into a two-stage optimization. The first stage determines the relative noise between pairs of images by pooling estimates from corresponding patch pairs in a probabilistic fashion. The second stage then jointly optimizes for all absolute noise parameters by conditioning them upon relative noise levels, which allows for a pairwise factorization of the probability distribution. We evaluate our noise estimation method using quantitative experiments to measure accuracy on synthetic data. Additionally, we employ the estimated noise levels for automatic denoising using "BM3D", and evaluate the quality of denoising on real-world photos through a user study.
View details
Preview abstract
We propose a framework for automatic enhancement of group photographs by facial expression analysis. We are motivated by the observation that group photographs are seldom perfect. Subjects may have inadvertently closed their eyes, may be looking away, or may not be smiling at that moment. Given a set of photographs of the same group of people, our algorithm uses facial analysis to determine a goodness score for each face instance in those photos. This scoring function is based on classifiers for facial expressions such as smiles and eye-closure, trained over a large set of annotated photos. Given these scores, a best composite for the set is synthesized by (a) selecting the photo with the best overall score, and (b) replacing any low-scoring faces in that photo with high-scoring faces of the same person from other photos, using alignment and seamless composition.
View details
Calibration-Free Rolling Shutter Removal
International Conference on Computational Photography [Best Paper], IEEE (2012)
Preview abstract
We present a novel algorithm for efficient removal of rolling shutter distortions in uncalibrated streaming videos. Our proposed method is calibration free as it does not need any knowledge of the camera used, nor does it require calibration using specially recorded calibration sequences. Our algorithm can perform rolling shutter removal under varying focal lengths, as in videos from CMOS cameras equipped with an optical zoom. We evaluate our approach across a broad range of cameras and video sequences demonstrating robustness, scaleability, and repeatability. We also conducted a user study, which demonstrates preference for the output of our algorithm over other state-of-the art methods. Our algorithm is computationally efficient, easy to parallelize, and robust to challenging artifacts introduced by various cameras with differing technologies.
View details
Shadow Removal for Aerial Imagery by Information Theoretic Intrinsic Image Analysis
Mei Han
Shengyang Dai
International Conference on Computational Photography, IEEE (2012)
Preview abstract
We present a novel technique for shadow removal based on an information theoretic approach to intrinsic image analysis. Our key observation is that any illumination change in the scene tends to increase the entropy of observed texture intensities. Similarly, the presence of texture in the scene increases the entropy of the illumination function. Consequently, we formulate the separation of an image into texture and illumination components as minimization of entropies of each component. We employ a non-parametric kernel-based quadratic entropy formulation, and present an efficient multi-scale iterative optimization algorithm for minimization of the resulting energy functional. Our technique may be employed either fully automatically, using a proposed learning based method for automatic initialization, or alternatively with small amount of user interaction. As we demonstrate, our method is particularly suitable for aerial images, which consist of either distinctive texture patterns, e.g. building facades, or soft shadows with large diffuse regions, e.g. cloud shadows.
View details
Weakly Supervised Learning of Object Segmentations from Web-Scale Video
Glenn Hartmann
Judy Hoffman
David Tsai
Omid Madani
James Rehg
ECCV'12 Proceedings of the 12th international conference on Computer Vision - Volume Part I, Springer-Verlag, Berlin, Heidelberg (2012), pp. 198-208
Preview abstract
We propose to learn pixel-level segmentations of objects from weakly labeled (tagged) internet videos. Specifically, given a large collection of raw YouTube content, along with potentially noisy tags, our goal is to automatically generate spatiotemporal masks for each object, such as "dog", without employing any pre-trained object detectors. We formulate this problem as learning weakly supervised classifiers for a set of independent spatio-temporal segments. The object seeds obtained using segment-level classifiers are further refined using graphcuts to generate high-precision object masks. Our results, obtained by training on a dataset of 20,000 YouTube videos weakly tagged into 15 classes, demonstrate automatic extraction of pixel-level object masks. Evaluated against a ground-truthed subset of 50,000 frames with pixel-level annotations, we confirm that our proposed methods can learn good object masks just by watching YouTube.
View details