William T. Freeman
Bill Freeman is a Senior Research Scientist at Google, managing a team within Machine Perception doing research in vision and graphics. He is also a faculty member at MIT, in the Electrical Engineering and Computer Science Department, and a member of CSAIL, the Computer Science and Artificial intelligence Laboratory there. He received outstanding paper awards at computer vision or machine learning conferences in 1997, 2006, 2009 and 2012, and test-of-time awards for papers from 1990 and 1995.
Research Areas
Authored Publications
Sort By
MetaCLUE: Towards Comprehensive Visual Metaphors Research
Brendan Driscoll
Zhiwei Jia
Garima Pruthi
Leonidas Guibas
Varun Jampani
CVPR (2023)
Preview abstract
Creativity is an indispensable part of human cognition and also an inherent part of how we make sense of the world. Metaphorical abstraction is fundamental in communicating creative ideas through nuanced relationships between abstract concepts such as feelings. While computer vision benchmarks and approaches predominantly focus on understanding and generating literal interpretations of images, metaphorical comprehension of images remains relatively unexplored. Towards this goal, we introduce MetaCLUE, a set of vision tasks on visual metaphor. We also collect high-quality and rich metaphor annotations (abstract objects, concepts, relationships along with their corresponding object boxes) as there do not exist any datasets that facilitate the evaluation of these tasks. We perform a comprehensive analysis of state-of-the-art models in vision and language based on our annotations, highlighting strengths and weaknesses of current approaches in visual metaphor Classification, Localization, Understanding (retrieval, question answering, captioning) and gEneration (text-to-image synthesis) tasks. We hope this work provides a concrete step towards developing AI systems with human-like creative capabilities.
View details
Preview abstract
Generative transformers have experienced rapid popularity growth in the computer vision community in synthesizing high-fidelity and high-resolution images. The best generative transformer models so far, however, still treat an image naively as a sequence of tokens, and decode an image sequentially following the raster scan ordering (i.e. line-by-line). We find this strategy neither optimal nor efficient. This paper proposes a novel image synthesis paradigm using a bidirectional transformer decoder, which we term MaskGIT. During training, MaskGIT learns to predict randomly masked tokens by attending to tokens in all directions. At inference time, the model begins with generating all tokens of an image simultaneously, and then refines the image iteratively conditioned on the previous generation. Our experiments demonstrate that MaskGIT significantly outperforms the state-of-the-art transformer model on the ImageNet dataset, and accelerates autoregressive decoding by up to 64x. Besides, we illustrate that MaskGIT can be easily extended to various image editing tasks, such as inpainting, extrapolation, and image manipulation.
View details
Neural Light Transport for Relighting and View Synthesis
Xiuming Zhang
Yun-Ta Tsai
Tiancheng Sun
Tianfan Xue
Philip Davidson
Christoph Rhemann
Paul Debevec
Ravi Ramamoorthi
ACM Transactions on Graphics, 40 (2021)
Preview abstract
The light transport (LT) of a scene describes how it appears under different lighting and viewing directions, and complete knowledge of a scene's LT enables the synthesis of novel views under arbitrary lighting. In this paper, we focus on image-based LT acquisition, primarily for human bodies within a light stage setup. We propose a semi-parametric approach to learn a neural representation of LT that is embedded in the space of a texture atlas of known geometric properties, and model all non-diffuse and global LT as residuals added to a physically-accurate diffuse base rendering. In particular, we show how to fuse previously seen observations of illuminants and views to synthesize a new image of the same scene under a desired lighting condition from a chosen viewpoint. This strategy allows the network to learn complex material effects (such as subsurface scattering) and global illumination, while guaranteeing the physical correctness of the diffuse LT (such as hard shadows). With this learned LT, one can relight the scene photorealistically with a directional light or an HDRI map, synthesize novel views with view-dependent effects, or do both simultaneously, all in a unified framework using a set of sparse, previously seen observations. Qualitative and quantitative experiments demonstrate that our neural LT (NLT) outperforms state-of-the-art solutions for relighting and view synthesis, without separate treatment for both problems that prior work requires.
View details
SLIDE: Single Image 3D Photography with Soft Layering and Depth-aware Inpainting
Varun Jampani*
Huiwen Chang*
Kyle Gregory Sargent
Abhishek Kar
Mike Krainin
Dominik Philemon Kaeser
Ce Liu
ICCV 2021 (2021)
Preview abstract
Single image 3D photography enables viewers to view a still image from novel viewpoints. Recent approaches for single-image view synthesis combine monocular depth network along with inpainting networks resulting in compelling novel view synthesis results. A drawback of these approaches is the use of hard layering making them not suitable to model intricate appearance effects such as matting. We present SLIDE, a modular and unified system for single image 3D photography that uses simple yet effective soft layering strategy to model appearance effects. In addition, we propose a novel depth-aware training of inpainting network suitable for 3D photography task. Extensive experimental analysis on 3 different view synthesis datasets in combination with user studies on in-the-wild image collections demonstrate the superior performance of our technique in comparison to existing strong baselines.
View details
AutoFlow: Learning a Better Training Set for Optical Flow
Daniel Vlasic
Charles Herrmann
Varun Jampani
Michael Krainin
Huiwen Chang
Ramin Zabih
Ce Liu
(2021)
Preview abstract
Synthetic datasets play a critical role in pre-training CNN models for optical flow, but they are painstaking to generate and hard to adapt to new applications. To automate the process, we present AutoFlow, a simple and effective method to render training data for optical flow that optimizes the performance of a model on a target dataset. AutoFlow takes a layered approach to render synthetic data, where the motion, shape, and appearance of each layer are controlled by learnable hyperparameters. Experimental results show that AutoFlow achieves state-of-the-art accuracy in pre-training both PWC-Net and RAFT. Our code and data are available at https://autoflow-google.github.io.
View details
THUNDR: Transformer-based 3D HUmaN Reconstruction with Markers
Mihai Zanfir
Andrei Zanfir
Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)
Preview abstract
We present THUNDR, a transformer-based deep neural network methodology to reconstruct the 3D pose and shape of people, given monocular RGB images. Key to our methodology is an intermediate 3D marker representation, where we aim to combine the predictive power of model-free output architectures and the regularizing, anthropometrically-preserving properties of a statistical human surface models like GHUM—a recently introduced, expressive full body statistical 3d human model, trained end-to-end. Our novel transformer-based prediction pipeline can focus on image regions relevant to the task, supports self-supervised regimes, and ensures that solutions are consistent with human anthropometry. We show state-of-the-art results on Human3.6M and 3DPW, for both the fully-supervised and the self-supervised models, for the task of inferring 3D human shape, joint positions, and global translation. Moreover, we observe very solid 3d reconstruction performance for difficult human poses collected in the wild. Models will be made available for research.
View details
Neural Descent for Visual 3D Human Pose and Shape
Andrei Zanfir
Mihai Zanfir
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), pp. 14484-14493
Preview abstract
We present a deep neural network methodology to reconstruct the 3d pose and shape of people, given image or video inputs. We rely on a recently introduced, expressive full body statistical 3d human model, GHUM, with facial expression and hand detail and aim to learn to reconstruct the model pose and shape states in a self-supervised regime. Central to our methodology, is a learning to learn approach, referred to as HUman Neural Descent (HUND) that avoids both second-order differentiation when training the model parameters, and expensive state gradient descent in order to accurately minimize a semantic differentiable rendering loss at test time. Instead, we rely on novel recurrent stages to update the pose and shape parameters such that not only losses are minimized effectively but the process is regularized in order to ensure progress.
The newly introduced architecture is tested extensively, and achieves state-of-the-art results on datasets like H3.6M and 3DPW, as well as in complex imagery collected in-the-wild.
View details
Explaining in Style: Training a GAN to explain a classifier in StyleSpace
Yossi Gandelsman
Yoav Itzhak Wald
Phillip Isola
Michal Irani
Proc. ICCV 2021
Preview abstract
Image classification models can depend on multiple different semantic attributes of the image. An explanation of the decision of the classifier needs to both discover and visualize these properties. Here we present StylEx, a method for doing this, by training a generative model to specifically explain multiple attributes that underlie classifier decisions. A natural source for such attributes is the S-space of StyleGAN, which is known to generate semantically meaningful dimensions in the image. However, these will typically not correspond to classifier-specific attributes since standard GAN training is not dependent on the classifier. To overcome this, we propose training procedure for
a StyleGAN, which incorporates the classifier model. This results in an S-space that captures distinct attributes underlying classifier outputs. After training, the model can be used to visualize the effect of changing multiple attributes per image, thus providing an image-specific explanation. We apply StylEx to multiple domains, including animals, leaves, faces and retinal images. For these, we show how an image can be changed in different ways to change its classifier prediction.
Our results show that the method finds attributes that align well with semantic ones, generate meaningful image-specific explanations, and are interpretable as measured in user-studies.
View details
GHUM & GHUML: Generative 3D Human Shape and Articulated Pose Models
Hongyi Xu
Andrei Zanfir
IEEE/CVF Conference on Computer Vision and Pattern Recognition (Oral) (2020), pp. 6184-6193
Preview abstract
We present a statistical, articulated 3D human shape modeling pipeline, within a fully trainable, modular, deep learning framework. Given high-resolution complete 3D body scans of humans, captured in various poses, together with additional closeups of their head and facial expressions, as well as hand articulation, and given initial, artist designed, gender neutral rigged quad-meshes, we train all model parameters including non-linear shape spaces based on variational auto-encoders, pose-space deformation correctives, skeleton joint center predictors, and blend skinning functions, in a single consistent learning loop. The models are simultaneously trained with all the 3d dynamic scan data (over60,000diverse human configurations in our new dataset) in order to capture correlations and en-sure consistency of various components. Models support facial expression analysis, as well as body (with detailed hand) shape and pose estimation. We provide fully train-able generic human models of different resolutions – the moderate-resolution GHUM consisting of 10,168 vertices and the low-resolution GHUML(ite) of 3,194 vertices –, run comparisons between them, analyze the impact of different components and illustrate their reconstruction from image data. The models are available for research.
View details
Multi-Plane Program Induction with 3D Box Priors
Yikai Li
Jiayuan Mao
Xiuming Zhang
Josh Tenenbaum
Jiajun Wu
Neural Information Processing Systems (NeurIPS) (2020)
Preview abstract
We consider two important structures in understanding and editing images: modeling regular, program-like texture or patterns in 2D planes, and 3D posing of these planes in the scene. Unlike prior work on image-based program synthesis, which assumes the image contains a single visible 2D plane, we present Box Program Induction (BPI), which infers a program-like scene representation that simultaneously models repeated structure on multiple 2D planes, the 3D position and orientation of the planes, and camera parameters, all from a single image. Our model assumes a box prior, i.e., that the image captures either an inner view or an outer view of a box in 3D. It uses neural networks to infer visual cues such as vanishing points, wireframe lines, or plane segmentations to guide a search-based algorithm to find the program that best explains the image. Such a holistic, structured scene representation enables 3D-aware interactive image editing operations such as inpainting missing pixels, changing camera parameters, and extrapolate the image contents.
View details