Chris Bregler
Chris Bregler is a Director / Principal Scientist at Google DeepMind. He received an Academy Award in the Science and Technology category for his work in visual effects. His other awards include the IEEE Longuet-Higgins Prize for "Fundamental Contributions in Computer Vision that Have Withstood the Test of Time," the Olympus Prize, and grants from the National Science Foundation, Packard Foundation, Electronic Arts, Microsoft, U.S. Navy, U.S. Airforce, and other agencies. Formerly a professor at New York University and Stanford University, he was named Stanford Joyce Faculty Fellow, Terman Fellow, and Sloan Research Fellow. In addition to working for several companies including Hewlett Packard, Interval, Disney Feature Animation, LucasFilm's ILM, and the New York Times, he was the executive producer of squid-ball.com, for which he built the world's largest real-time motion capture volume. He received his M.S. and Ph.D. in Computer Science from U.C. Berkeley.
Full publications list at http://chris.bregler.com
Research Areas
Authored Publications
Sort By
Preview abstract
Alpha matting is widely used in video conferencing as well as in movies, television, and online video publishing sites such as YouTube. Deep learning approaches to the matte extraction problem are well suited to video conferencing due to the relatively consistent subject (front-facing humans), however they are less appropriate for entertainment videos where varied subjects (spaceships, monsters, etc.) may appear only a few times. We introduce a \emph{one-shot} matte extraction approach that targets these applications. Our approach is based on the deep image prior, which optimizes a deep neural network to map a fixed random input to a single output, thereby providing a somewhat deep and hierarchical encoding of the particular image. We make use of the representations in the penultimate layer to interpolate coarse and incomplete "trimap" constraints. The algorithm is both very simple and surprisingly effective, though (in common with classic methods that solve large sparse linear systems) it is too slow for real-time or interactive use.
View details
Preview abstract
Despite the recent attention to DeepFakes, one of the most prevalent ways to mislead audiences on social media is the use of unaltered images in a new but false context. To address these challenges and support fact-checkers, we propose a new method that automatically detects out-of-context image and text pairs. Our key insight is to leverage the grounding of image with text to distinguish out-of-context scenarios that cannot be disambiguated with language alone. We propose a self-supervised training strategy where we only need a set of captioned images. At train time, our method learns to selectively align individual objects in an image with textual claims, without explicit supervision. At test time, we check if both captions correspond to the same object(s) in the image but are semantically different, which allows us to make fairly accurate out-of-context predictions. Our method achieves 85% out-of-context detection accuracy. To facilitate benchmarking of this task, we create a large-scale dataset of 200K images with 450K textual captions from a variety of news websites, blogs, and social media posts. The dataset and source code is publicly available at this https URL.
View details
LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video using Pose and Lighting Normalization
Avisek Lahiri
Christian Frueh
John Lewis
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021) (to appear)
Preview abstract
In this paper, we present a video-based learning framework for animating personalized 3D talking faces from audio. We introduce two training-time data normalizations that significantly improve data sample efficiency. First, we isolate and represent faces in a normalized space that decouples 3D geometry, head pose, and texture. This decomposes the prediction problem into regressions over the 3D face shape and the corresponding 2D texture atlas. Second, we leverage facial symmetry and approximate albedo constancy of skin to isolate and remove spatio-temporal lighting variations. Together, these normalizations allow simple networks to generate high fidelity lip-sync videos under novel ambient illumination while training with just a single speaker-specific video. Further, to stabilize temporal dynamics, we introduce an auto-regressive approach that conditions the model on its previous visual state. Human ratings and objective metrics demonstrate that our method outperforms contemporary state-of-the-art audio-driven video reenactment benchmarks in terms of realism, lip-sync and visual quality scores. We illustrate several applications enabled by our framework.
View details
Preview abstract
With a proliferation of generic domain-adaptation approaches, we report a simple yet effective technique for learning difficult per-pixel 2.5D and 3D regression representations of articulated people. We obtained strong sim-to-real domain generalization for the 2.5D DensePose estimation task and the 3D human surface normal estimation task. On the multi-person DensePose MSCOCO benchmark, our approach outperforms the state-of-the-art methods which are trained on real images that are densely labelled. This is an important result since obtaining human manifold's intrinsic $uv$ coordinates on real images is time consuming and prone to labeling noise. Additionally, we present our model's 3D surface normal predictions on the MSCOCO dataset that lacks any real 3D surface normal labels. The key to our approach is to mitigate the ``Inter-domain Covariate Shift" with a carefully selected training batch from a mixture of domain samples, a deep batch-normalized residual network, and a modified multi-task learning objective. Our approach is complementary to existing domain-adaptation techniques and can be applied to other dense per-pixel pose estimation problems.
View details
Preview abstract
In this work we propose a model that enables controlled manipulation of visual attributes of real ``target'' images (\eg lighting, expression or pose) using only implicit supervision with the synthetic ``source'' exemplars. Specifically, our model learns a shared low-dimensional representation of input images from both domains in which a property of interest is isolated from other content features of the input. By using triplets of synthetic images that demonstrate modification of the visual attribute that we would like to control (for example mouth opening) we are able to perform disentanglement of image representations with respect to this attribute without using explicit attribute labels in either domain. Since our technique relies on triplets instead of explicit labels, it can be applied to shape, texture, lighting, or other properties that are difficult to measure or represent as explicit conditioners. We quantitatively analyze the degree to which trained models learn to isolate the property of interest from other content features with a proof-of-concept digit dataset and demonstrate results in a far more difficult setting, learning to manipulate real faces using a synthetic 3D faces dataset. We also explore limitations of our model with respect to differences in distributions of properties observed in two domains.
View details
Towards Accurate Multi-person Pose Estimation in the Wild
George Papandreou
Tyler Zhu
Nori Kanazawa
Alexander Toshev
CVPR (2017)
Preview abstract
We propose a method for multi-person detection and 2-D keypoint localization (human pose estimation) that achieves state-of-the-art results on the challenging COCO keypoints task. It is a simple, yet powerful, top-down approach consisting of two stages.
In the first stage, we predict the location and scale of boxes which are likely to contain people; for this we use the Faster RCNN detector with an Inception-ResNet architecture. In the second stage, we estimate the keypoints of the person potentially contained in each proposed bounding box. For each keypoint type we predict dense heatmaps and offsets using a fully convolutional ResNet. To combine these outputs we introduce a novel aggregation procedure to obtain highly localized keypoint predictions. We also use a novel form of keypoint-based Non-Maximum-Suppression (NMS), instead of the cruder box-level NMS, and a novel form of keypoint-based confidence score estimation, instead of box-level scoring.
Our final system achieves average precision of 0.636 on the COCO test-dev set and the 0.628 test-standard sets, outperforming the CMU-Pose winner of the 2016 COCO keypoints challenge. Further, by using additional labeled data we obtain an even higher average precision of 0.668 on the test-dev set and 0.658 on the test-standard set, thus achieving a roughly 10% improvement over the previous best performing method on the same challenge.
View details