Jump to Content
Vivek Kwatra

Vivek Kwatra

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Synthesis-Assisted Video Prototyping From a Document
    Brian R. Colonna
    Christian Frueh
    UIST 2022: ACM Symposium on User Interface Software and Technology (2022)
    Preview abstract Video productions commonly start with a script, especially for talking head videos that feature a speaker narrating to the camera. When the source materials come from a written document -- such as a web tutorial, it takes iterations to refine content from a text article to a spoken dialogue, while considering visual compositions in each scene. We propose Doc2Video, a video prototyping approach that converts a document to interactive scripting with a preview of synthetic talking head videos. Our pipeline decomposes a source document into a series of scenes, each automatically creating a synthesized video of a virtual instructor. Designed for a specific domain -- programming cookbooks, we apply visual elements from the source document, such as a keyword, a code snippet or a screenshot, in suitable layouts. Users edit narration sentences, break or combine sections, and modify visuals to prototype a video in our Editing UI. We evaluated our pipeline with public programming cookbooks. Feedback from professional creators shows that our method provided a reasonable starting point to engage them in interactive scripting for a narrated instructional video. View details
    LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video using Pose and Lighting Normalization
    Avisek Lahiri
    Christian Frueh
    John Lewis
    IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021) (to appear)
    Preview abstract In this paper, we present a video-based learning framework for animating personalized 3D talking faces from audio. We introduce two training-time data normalizations that significantly improve data sample efficiency. First, we isolate and represent faces in a normalized space that decouples 3D geometry, head pose, and texture. This decomposes the prediction problem into regressions over the 3D face shape and the corresponding 2D texture atlas. Second, we leverage facial symmetry and approximate albedo constancy of skin to isolate and remove spatio-temporal lighting variations. Together, these normalizations allow simple networks to generate high fidelity lip-sync videos under novel ambient illumination while training with just a single speaker-specific video. Further, to stabilize temporal dynamics, we introduce an auto-regressive approach that conditions the model on its previous visual state. Human ratings and objective metrics demonstrate that our method outperforms contemporary state-of-the-art audio-driven video reenactment benchmarks in terms of realism, lip-sync and visual quality scores. We illustrate several applications enabled by our framework. View details
    Preview abstract The use of 360 degree cameras, enabling one to record and share full-spherical $360^\circ \times 180^\circ$ view without any cropping in the viewing angle, is on the rise. Shake in such videos is problematic, especially when used in conjunction with VR headsets causing cyber sickness to the viewer. The current state of the art video stabilization algorithm \cite{kopf16} designed specifically for 360 degree videos consider the special geometrical constraints in such videos. However, the specific steps in the algorithm can abruptly change the viewing direction in a video leading to unnatural experience for the viewer. In this paper, we propose to fix the anomaly by the use of $L1$ smoothness constraints on the camera path, as suggested by Grundmann \etal \cite{grundmann11}. The modified algorithm is generic and our experiments indicate that the proposed algorithm not only gives a more natural and smoother stabilization for 360 degree videos but can be used for stabilizing normal field of view videos as well. View details
    Preview abstract One of the main challenges of social interaction in virtual reality settings is that head-mounted displays occlude a large portion of the face, blocking facial expressions and thereby restricting social engagement cues among users. Hence, auxiliary means of sensing and conveying these expressions are needed. We present an algorithm to automatically infer expressions by analyzing only a partially occluded face while the user is engaged in a virtual reality experience. Specifically, we show that images of the user's eyes captured from an IR gaze-tracking camera within a VR headset are sufficient to infer a select subset of facial expressions without the use of any fixed external camera. Using these inferences, we can generate dynamic avatars in real-time which function as an expressive surrogate for the user. We propose a novel data collection pipeline as well as a novel approach for increasing CNN accuracy via personalization. Our results show a mean accuracy of 74% (F1 of 0.73) among 5 `emotive' expressions and a mean accuracy of 70% (F1 of 0.68) among 10 distinct facial action units, outperforming human raters. View details
    Headset Removal for Virtual and Mixed Reality
    Christian Frueh
    SIGGRAPH Talks 2017, ACM SIGGRAPH (to appear)
    Preview abstract Virtual Reality (VR) has advanced significantly in recent years and allows users to explore novel environments (both real and imaginary), play games, and engage with media in a way that is unprecedentedly immersive. However, compared to physical reality, sharing these experiences is difficult because the user's virtual environment is not easily observable from the outside and the user's face is partly occluded by the VR headset. Mixed Reality (MR) is a medium that alleviates some of this disconnect by sharing the virtual context of a VR user in a flat video format that can be consumed by an audience to get a feel for the user's experience. Even though MR allows audiences to connect actions of the VR user with their virtual environment, empathizing with them is difficult because their face is hidden by the headset. We present a solution to address this problem by virtually removing the headset and revealing the face underneath it using a combination of 3D vision, machine learning and graphics techniques. We have integrated our headset removal approach with Mixed Reality, and demonstrate results on several VR games and experiences. View details
    Preview abstract Personal photo albums are heavily biased towards faces of people, but most state-of-the-art algorithms for image denoising and noise estimation do not exploit facial information. We propose a novel technique for jointly estimating noise levels of all face images in a photo collection. Photos in a personal album are likely to contain several faces of the same people. While some of these photos would be clean and high quality, others may be corrupted by noise. Our key idea is to estimate noise levels by comparing multiple images of the same content that differ predominantly in their noise content. Specifically, we compare geometrically and photometrically aligned face images of the same person. Our estimation algorithm is based on a probabilistic formulation that seeks to maximize the joint probability of estimated noise levels across all images. We propose an approximate solution that decomposes this joint maximization into a two-stage optimization. The first stage determines the relative noise between pairs of images by pooling estimates from corresponding patch pairs in a probabilistic fashion. The second stage then jointly optimizes for all absolute noise parameters by conditioning them upon relative noise levels, which allows for a pairwise factorization of the probability distribution. We evaluate our noise estimation method using quantitative experiments to measure accuracy on synthetic data. Additionally, we employ the estimated noise levels for automatic denoising using "BM3D", and evaluate the quality of denoising on real-world photos through a user study. View details
    Shadow Removal for Aerial Imagery by Information Theoretic Intrinsic Image Analysis
    Mei Han
    Shengyang Dai
    International Conference on Computational Photography, IEEE (2012)
    Preview abstract We present a novel technique for shadow removal based on an information theoretic approach to intrinsic image analysis. Our key observation is that any illumination change in the scene tends to increase the entropy of observed texture intensities. Similarly, the presence of texture in the scene increases the entropy of the illumination function. Consequently, we formulate the separation of an image into texture and illumination components as minimization of entropies of each component. We employ a non-parametric kernel-based quadratic entropy formulation, and present an efficient multi-scale iterative optimization algorithm for minimization of the resulting energy functional. Our technique may be employed either fully automatically, using a proposed learning based method for automatic initialization, or alternatively with small amount of user interaction. As we demonstrate, our method is particularly suitable for aerial images, which consist of either distinctive texture patterns, e.g. building facades, or soft shadows with large diffuse regions, e.g. cloud shadows. View details
    Calibration-Free Rolling Shutter Removal
    Daniel Castro
    International Conference on Computational Photography [Best Paper], IEEE (2012)
    Preview abstract We present a novel algorithm for efficient removal of rolling shutter distortions in uncalibrated streaming videos. Our proposed method is calibration free as it does not need any knowledge of the camera used, nor does it require calibration using specially recorded calibration sequences. Our algorithm can perform rolling shutter removal under varying focal lengths, as in videos from CMOS cameras equipped with an optical zoom. We evaluate our approach across a broad range of cameras and video sequences demonstrating robustness, scaleability, and repeatability. We also conducted a user study, which demonstrates preference for the output of our algorithm over other state-of-the art methods. Our algorithm is computationally efficient, easy to parallelize, and robust to challenging artifacts introduced by various cameras with differing technologies. View details
    All Smiles : Automatic Photo Enhancement by Facial Expression Analysis
    Rajvi Shah
    Conference for Visual Media Production (CVMP 2012) [Best Paper]
    Preview abstract We propose a framework for automatic enhancement of group photographs by facial expression analysis. We are motivated by the observation that group photographs are seldom perfect. Subjects may have inadvertently closed their eyes, may be looking away, or may not be smiling at that moment. Given a set of photographs of the same group of people, our algorithm uses facial analysis to determine a goodness score for each face instance in those photos. This scoring function is based on classifiers for facial expressions such as smiles and eye-closure, trained over a large set of annotated photos. Given these scores, a best composite for the set is synthesized by (a) selecting the photo with the best overall score, and (b) replacing any low-scoring faces in that photo with high-scoring faces of the same person from other photos, using alignment and seamless composition. View details
    Weakly Supervised Learning of Object Segmentations from Web-Scale Video
    Glenn Hartmann
    Judy Hoffman
    David Tsai
    Omid Madani
    James Rehg
    ECCV'12 Proceedings of the 12th international conference on Computer Vision - Volume Part I, Springer-Verlag, Berlin, Heidelberg (2012), pp. 198-208
    Preview abstract We propose to learn pixel-level segmentations of objects from weakly labeled (tagged) internet videos. Specifically, given a large collection of raw YouTube content, along with potentially noisy tags, our goal is to automatically generate spatiotemporal masks for each object, such as "dog", without employing any pre-trained object detectors. We formulate this problem as learning weakly supervised classifiers for a set of independent spatio-temporal segments. The object seeds obtained using segment-level classifiers are further refined using graphcuts to generate high-precision object masks. Our results, obtained by training on a dataset of 20,000 YouTube videos weakly tagged into 15 classes, demonstrate automatic extraction of pixel-level object masks. Evaluated against a ground-truthed subset of 50,000 frames with pixel-level annotations, we confirm that our proposed methods can learn good object masks just by watching YouTube. View details
    Preview abstract We present a novel algorithm for automatically applying constrainable, L1-optimal camera paths to generate stabilized videos by removing undesired motions. Our goal is to compute camera paths that are composed of constant, linear and parabolic segments mimicking the camera motions employed by professional cinematographers. To this end, our algorithm is based on a linear programming framework to minimize the first, second, and third derivatives of the resulting camera path. Our method allows for video stabilization beyond the conventional filtering of camera paths that only suppresses high frequency jitter. We incorporate additional constraints on the path of the camera directly in our algorithm, allowing for stabilized and retargeted videos. Our approach accomplishes this without the need of user interaction or costly 3D reconstruction of the scene, and works as a post-process for videos from any camera or from an online source. View details
    Example-based Image Compression
    Jing-Yu Cui
    Saurabh Mathur
    Michele Covell
    Mei Han
    International Conference on Image Processing (ICIP 2010)
    Preview abstract The current standard image-compression approaches rely on fairly simple predictions, using either block- or wavelet-based methods. While many more sophisticated texture-modeling approaches have been proposed, most do not provide a significant improvement in compression rate over the current standards at a workable encoding complexity level. We re-examine this area, using example-based texture prediction. We find that we can provide consistent and significant improvements over JPEG, reducing the bit rate by more than 20% for many PSNR levels. These improvements require consideration of the differences between residual energy and prediction/residual compressibility when selecting a texture prediction, as well as careful control of the computational complexity in encoding. View details
    Preview abstract This paper presents algorithms for efficiently computing the covariance matrix for features that form sub-windows in a large multi-dimensional image. For example, several image processing applications, e.g. texture analysis/synthesis, image retrieval, and compression, operate upon patches within an image. These patches are usually projected onto a low-dimensional feature space using dimensionality reduction techniques such as Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA), which in-turn requires computation of the covariance matrix from a set of features. Covariance computation is usually the bottleneck during PCA or LDA (O(nd^2) where n is the number of pixels in the image and d is the dimensionality of the vector). Our approach reduces the complexity of covariance computation by exploiting the redundancy between feature vectors corresponding to overlapping patches. Specifically, we show that the covariance between two feature components can be reduced to a function of the relative displacement between those components in patch space. One can then employ a lookup table to store covariance values by relative displacement. By operating in the frequency domain, this lookup table can be computed in O(n log n) time. We allow the patches to sub-sample the image, which is useful for hierarchical processing and also enables working with filtered responses over these patches, such as local gist features. We also propose a method for fast projection of sub-window patches onto the low-dimensional space. View details
    Preview abstract We introduce a new algorithm for video retargeting that uses discontinuous seam-carving in both space and time for resizing videos. We propose a novel appearance-based temporal coherence formulation that allows for frame-by-frame processing and results in temporally discontinuous seams, as opposed to geometrically smooth and continuous seams. This formulation optimizes the difference in appearance of the resultant retargeted frame to the optimal temporally coherent one, and allows for carving around fast moving salient regions. Additionally, we generalize the idea of appearance-based coherence to the spatial domain by introducing piece-wise spatial seams. Our spatial coherence measure minimizes the change in gradients during retargeting, which preserves spatial detail better than minimization of color difference alone. We also show that retargeting based on per-frame saliency (gradient-based or feature-based) does not always lead to desirable results and propose a novel automatically computed measure of spatio-temporal saliency. As needed, the user can also augment the saliency by interactive region-brushing. Our retargeting algorithm processes the video sequentially, which allows us to deal with streaming videos. We demonstrate results over a wide range of video examples and evaluate the effectiveness of each component of our algorithm. View details
    Preview abstract We present an efficient and scalable technique for spatio-temporal segmentation of long video sequences using a hierarchical graph-based algorithm. We begin by over-segmenting a volumetric video graph into space-time regions grouped by appearance. We then construct a ``region graph" over the obtained segmentation and iteratively repeat this process over multiple levels to create a tree of spatio-temporal segmentations. This hierarchical approach generates high quality segmentations which are temporally coherent with stable region boundaries. Additionally, the resulting segmentation hierarchy allows subsequent applications to choose from varying levels of granularity. We further improve segmentation quality by using dense optical flow when constructing the initial graph. We also propose two novel approaches to improve the scalability of our technique: (a) a parallel out-of-core algorithm that can process volumes much larger than an in-core algorithm, and (b) a clip-based processing algorithm that divides the video into overlapping clips in time, and segments them successively while enforcing consistency. We can segment video shots as long as 40 seconds without compromising quality, and even support a streaming mode for arbitrarily long videos, albeit without the ability to process them hierarchically. View details
    State of the Art in Example-based Texture Synthesis
    Li-Yi Wei
    Sylvain Lefebvre
    Greg Turk
    Eurographics 2009, State of the Art Report, EG-STAR, Eurographics Association
    Fluid in Video: Augmenting Real Video with Simulated Fluids
    Philippos Mordohai
    Rahul Narain
    Sashi Kumar Penta
    Mark Carlson
    Marc Pollefeys
    Ming C. Lin
    Comput. Graph. Forum (Proc. Eurographics), vol. 27 (2008), pp. 487-496
    Feature-Guided Dynamic Texture Synthesis on Continuous Flows
    Rahul Narain
    Huai-Ping Lee
    Theodore Kim
    Mark Carlson
    Ming Lin
    EGSR '07 (2007)
    Texturing Fluids
    David Adalsteinsson
    Theodore Kim
    Nipun Kwatra
    Mark Carlson
    Ming Lin
    IEEE Trans. Visualization and Computer Graphics, vol. 13 (2007), pp. 939-952
    Semantic Photo Synthesis
    Matthew Johnson
    Gabriel J. Brostow
    Jamie Shotton
    Ognjen Arandjelović
    Roberto Cipolla
    Computer Graphics Forum (Proc. Eurographics), vol. 25 (2006), pp. 407-413
    Texture Optimization for Example-based Synthesis
    Aaron Bobick
    Nipun Kwatra
    ACM Transactions on Graphics, SIGGRAPH 2005 (2005)
    Mixture Trees for Modeling and Fast Conditional Sampling with Applications in Vision and Graphics
    Frank Dellaert
    Sang Min Oh
    CVPR '05: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1, IEEE Computer Society, Washington, DC, USA, pp. 619-624
    Novel Skeletal Representation For Articulated Creatures
    Gabriel J. Brostow
    Drew Steedly
    ECCV04 (2004), Vol III: 66-78
    Graphcut Textures: Image and Video Synthesis Using Graph Cuts
    Arno Schödl
    Greg Turk
    Aaron Bobick
    ACM Transactions on Graphics, SIGGRAPH 2003, vol. 22 (2003), pp. 277-286
    Space-Time Surface Simplification and Edgebreaker Compression for 2D Cel Animations
    Jarek Rossignac
    International Journal on Shape Modeling, vol. 8 (2002)
    Temporal Integration of Multiple Silhouette-based Body-part Hypotheses
    Aaron F. Bobick
    Amos Y. Johnson
    IEEE Computer Vision and Pattern Recognition (2001)