Forrester Cole

Forrester Cole

Forrester is a software engineer working on computer vision and graphics research, particularly 3D understanding of images and videos. Prior to Google, Forrester was a postdoctoral researcher at Pixar Animation Studios and MIT. He completed his PhD at Princeton University under Adam Finkelstein.

Research Areas

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Associating Objects and their Effects in Unconstrained Monocular Video
    Erika Lu
    Zhengqi Li
    Leonid Sigal
    IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2023
    Preview abstract We propose a method to decompose a video into a back- ground and a set of foreground layers, where the back- ground captures stationary elements while the foreground layers capture moving objects along with their associated effects (e.g. shadows and reflections). Our approach is de- signed for unconstrained monocular videos, with arbitrary camera and object motion. Prior work that tackles this problem assumes that the video can be mapped onto a fixed 2D canvas, severely limiting the possible space of camera motion. Instead, our method applies recent progress in monocular camera pose and depth estimation to create a full, RGBD video layer for the background, along with a video layer for each foreground object. To solve the under- constrained decomposition problem, we propose a new loss formulation based on multi-view consistency. We test our method on challenging videos with complex camera motion and show significant qualitative improvement over current methods. View details
    SCOOP: Self-Supervised Correspondence and Optimization-Based Scene Flow
    Itai Lang
    Shai Avidan
    IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2023
    Preview abstract Scene flow estimation is a long-standing problem in computer vision, where the goal is to find the scene's 3D motion from its consecutive observations. Recently, there is a research effort to compute scene flow using 3D point clouds. A main approach is to train a regression model that consumes a source and target point clouds and outputs the per-point translation vector. An alternative approach is to learn point correspondence between the point clouds, concurrently with a refinement regression of the initial flow. In both approaches the task is very challenging, since the flow is regressed in the free 3D space, and a typical solution is to resort to a large annotated synthetic dataset. We introduce CorrFlow, a new method for scene flow estimation that can be learned on a small amount of data without using ground-truth flow supervision. In contrast to previous works, we train a pure correspondence model that is focused on learning point feature representation, and initialize the flow as the difference between a source point and its softly corresponding target point. Then, at test time, we directly optimize a flow refinement component with a self-supervised objective, which leads to a coherent flow field between the point clouds. Experiments on widely used datasets demonstrate the performance gains achieved by our method compared to existing leading techniques. View details
    DynIBaR: Neural Dynamic Image-Based Rendering
    Zhengqi Li
    Qianqian Wang
    Computer Vision and Pattern Recognition (CVPR) (2023)
    Preview abstract We address the problem of synthesizing novel views from a monocular video depicting complex dynamic scenes. State-of-the-art methods based on temporally varying Neural Radiance Fields (aka \emph{dynamic NeRFs}) have shown impressive results on this task. However, for long videos with complex object motions and uncontrolled camera trajectories, these methods can produce blurry or inaccurate renderings, hampering their use in real-world applications. Rather than encoding a dynamic scene within the weights of MLPs, we present a new method that addresses these limitations by adopting a volumetric image-based rendering framework that synthesizes new viewpoints by aggregating features from nearby views in a scene-motion-aware manner. Our system preserves the advantages for modeling complex scenes and view-dependent effects, but enables synthesizing photo-realistic novel views from long videos featuring complex scene dynamics with unconstrained camera trajectories. We demonstrate significant improvements over state-of-the-art methods on dynamic scene datasets, and also apply our approach to in-the-wild videos with challenging camera and object motion, where prior methods fail to produce high-quality renderings. View details
    Preview abstract The goal of this project is to learn a 3D shape representation that enables accurate surface reconstruction, compact storage, efficient computation, consistency for similar shapes, generalization across diverse shape categories, and inference from depth camera observations. Towards this end, we introduce Local Deep Implicit Functions (LDIF), a 3D shape representation that decomposes space into a structured set of learned implicit functions. We provide networks that infer the space decomposition and local deep implicit functions from a 3D mesh or posed depth image. During experiments, we find that it provides 10.3 points higher surface reconstruction accuracy (F-Score) than the state-of-the-art (OccNet), while requiring fewer than 1 percent of the network parameters. Experiments on posed depth image completion and generalization to unseen classes show 15.8 and 17.8 point improvements over the state-of-the-art, while producing a structured 3D representation for each input with consistency across diverse shape collections. View details
    Preview abstract We present a method for retiming people in an ordinary, natural video---manipulating and editing the time in which different motions of individuals in the video occur. We can temporally align different motions, change the speed of certain actions (speeding up/slowing down, or entirely "freezing" people), or "erase" selected people from the video altogether. We achieve these effects computationally via a dedicated learning-based layered video representation, where each frame in the video is decomposed into separate RGBA layers, representing the appearance of different people in the video. A key property of our model is that it not only disentangles the direct motions of each person in the input video, but also correlates each person automatically with the scene changes they generate---e.g., shadows, reflections, and motion of loose clothing. The layers can be individually retimed and recombined into a new video, allowing us to achieve realistic, high-quality renderings of retiming effects for real-world videos depicting complex actions and involving multiple individuals, including dancing, trampoline jumping, or group running. View details
    Learning the Depths of Moving People by Watching Frozen People
    Zhengqi Li
    Ce Liu
    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
    Preview abstract We present a method for predicting dense depth in scenarios where both a monocular camera and people in the scene are freely moving. Existing methods for recovering depth for dynamic, non-rigid objects from monocular video impose strong assumptions on the objects' motion and often can recover only a sparse depth. In this paper, we take a data-driven approach and learn human depth priors from a large corpus of data. Specifically, we use a new source of data comprised of thousands of Internet videos in which people imitate mannequins, i.e., people freeze in diverse, natural poses, while a hand-held camera is touring the scene. We then create training data using modern Multi-View Stereo (MVS) methods, and design a model that is applied to dynamic scene at inference time. Our method makes use of motion parallax beyond single view and shows clear advantages over state-of-the-art monocular depth prediction methods. We demonstrate the applicability of our method on real-world sequences captured by a moving hand-held camera, depicting complex human actions. We show various 3D effects such as re-focusing, creating a stereoscopic video from a monocular one, and inserting virtual objects to the scene, all produced using our predicted depth maps. View details
    Preview abstract Extracting and predicting object structure and dynamics from videos without supervision is a major challenge in machine learning. To address this challenge, we adopt a keypoint-based image representation and learn a stochastic dynamics model of the keypoints. Future frames are reconstructed from the keypoints and a reference frame. By modeling dynamics in the keypoint coordinate space, we achieve stable learning and avoid compounding of errors in pixel space. Our method improves upon unstructured representations both for pixel-level video prediction and for downstream tasks requiring object-level understanding of motion dynamics. We evaluate our model on diverse datasets: a multi-agent sports dataset, the Human3.6M dataset, and datasets based on continuous control tasks from the DeepMind Control Suite. The spatially structured representation outperforms unstructured representations on a range of motion-related tasks such as object tracking, action recognition and reward prediction. View details
    Unsupervised Training for 3D Morphable Model Regression
    Kyle Genova
    Aaron Maschinot
    Daniel Vlasic
    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
    Preview abstract We present a method for training a regression network from image pixels to 3D morphable model coordinates using only unlabeled photographs. The training loss is based on features from a facial recognition network, computed on-the-fly by rendering the predicted faces with a differentiable renderer. To make training from features feasible and avoid network fooling effects, we introduce three objectives: a batch regularization loss that encourages the output distribution to match the distribution of the morphable model, a loopback loss that ensures the regression network can correctly reinterpret its own output, and a multi-view loss that compares the predicted 3D face to the input photograph from multiple viewing angles. We train a regression network using these objectives, a set of unlabeled photographs, and the morphable model itself, and demonstrate state-of-the-art results. View details
    Preview abstract Style transfer usually refers to the task of applying color and texture information from a specific style image to a given content image while preserving the structure of the latter. Here we tackle the more generic problem of semantic style transfer: given two unpaired collections of images, we aim to learn a mapping between the corpus-level style of each collection, while preserving semantic content shared across the two domains. We introduce XGAN ("Cross-GAN"), a dual adversarial autoencoder, which captures a shared representation of the common domain semantic content in an unsupervised way, while jointly learning the domain-to-domain image translations in both directions. We exploit ideas from the domain adaptation literature and define a semantic consistency loss which encourages the model to preserve semantics in the learned embedding space. We report promising qualitative results for the task of face-to-cartoon translation. The cartoon dataset we collected for this purpose is in the process of being released as a new benchmark for semantic style transfer. View details
    Preview abstract We present a method for synthesizing a frontal, neutral-expression image of a person's face given an input face photograph. This is achieved by learning to generate facial landmarks and textures from features extracted from a facial-recognition network. Unlike previous approaches, our encoding feature vector is largely invariant to lighting, pose, and facial expression. Exploiting this invariance, we train our decoder network using only frontal, neutral-expression photographs. Since these photographs are well aligned, we can decompose them into a sparse set of landmark points and aligned texture maps. The decoder then predicts landmarks and textures independently and combines them using a differentiable image warping operation. The resulting images can be used for a number of applications, such as analyzing facial attributes, exposure and white balance adjustment, or creating a 3-D avatar. View details