Noah Snavely
Research Areas
Authored Publications
Sort By
Preview abstract
We present a method for generating Streetscapes --- long sequences of views through an on-the-fly synthesized city-scale scene. Our generation is conditioned by language input (e.g., city name, weather), as well as an underlying map/layout hosting the desired trajectory. Compared to recent models for video generation or 3D view synthesis, our method can scale to much longer-range camera trajectories, spanning several city blocks, while maintaining visual quality and consistency. To achieve this goal, we build on recent work on video diffusion, used within an autoregressive framework that can easily scale to long sequences. In particular, we introduce a new temporal imputation method that prevents our autoregressive approach from drifting from the distribution of realistic city imagery. We train our Streetscapes system on a compelling source of data-posed imagery from Google Street View, along with contextual map data-which allows users to generate city views conditioned on any desired city layout, with controllable camera poses.
View details
Preview abstract
We present an approach to modeling an image-space prior on scene motion. Our prior is learned from a collection of motion trajectories extracted from real video sequences depicting natural, oscillatory dynamics such as trees, flowers, candles, and clothes swaying in the wind. We model this dense, long-term motion prior in the Fourier domain:given a single image, our trained model uses a frequency-coordinated diffusion sampling process to predict a spectral volume, which can be converted into a motion texture that spans an entire video. Along with an image-based rendering module, these trajectories can be used for a number of downstream applications, such as turning still images into seamlessly looping videos, or allowing users to realistically interact with objects in real pictures by interpreting the spectral volumes as image-space modal bases, which approximate object dynamics.
View details
Persistent Nature: A Generative Model of Unbounded 3D Worlds
Lucy Chai
Zhengqi Li
Phillip Isola
Computer Vision and Pattern Recognition (CVPR) (2023)
Preview abstract
Despite increasingly realistic image quality, recent 3D image generative models often operate on bounded domains with limited camera motions. We investigate the task of unconditionally synthesizing unbounded nature scenes, enabling arbitrarily large camera motion while maintaining a persistent 3D world model. Our scene representation consists of an extendable, planar scene layout grid, which can be rendered from arbitrary camera poses via a 3D decoder and volume rendering, and a panoramic skydome. Based on this representation, we learn a generative world model solely from single-view internet photos. Our method enables simulating long flights through 3D landscapes, while maintaining global scene consistency--for instance, returning to the starting point yields the same view of the scene. Our approach enables scene extrapolation beyond the fixed bounds of current 3D generative models, while also supporting a persistent, camera-independent world representation that stands in contrast to auto-regressive 3D prediction models.
View details
DynIBaR: Neural Dynamic Image-Based Rendering
Zhengqi Li
Qianqian Wang
Computer Vision and Pattern Recognition (CVPR) (2023)
Preview abstract
We address the problem of synthesizing novel views from a monocular video depicting complex dynamic scenes.
State-of-the-art methods based on temporally varying Neural Radiance Fields (aka \emph{dynamic NeRFs}) have shown impressive results on this task.
However, for long videos with complex object motions and uncontrolled camera trajectories, these methods can produce blurry or inaccurate renderings, hampering their use in real-world applications.
Rather than encoding a dynamic scene within the weights of MLPs, we present a new method that addresses these limitations by adopting a volumetric image-based rendering framework that synthesizes new viewpoints by aggregating features from nearby views in a scene-motion-aware manner.
Our system preserves the advantages for modeling complex scenes and view-dependent effects, but enables synthesizing photo-realistic novel views from long videos featuring complex scene dynamics with unconstrained camera trajectories.
We demonstrate significant improvements over state-of-the-art methods on dynamic scene datasets, and also apply our approach to in-the-wild videos with challenging camera and object motion, where prior methods fail to produce high-quality renderings.
View details
Associating Objects and their Effects in Unconstrained Monocular Video
Erika Lu
Zhengqi Li
Leonid Sigal
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2023
Preview abstract
We propose a method to decompose a video into a back-
ground and a set of foreground layers, where the back-
ground captures stationary elements while the foreground
layers capture moving objects along with their associated
effects (e.g. shadows and reflections). Our approach is de-
signed for unconstrained monocular videos, with arbitrary
camera and object motion. Prior work that tackles this
problem assumes that the video can be mapped onto a fixed
2D canvas, severely limiting the possible space of camera
motion. Instead, our method applies recent progress in
monocular camera pose and depth estimation to create a
full, RGBD video layer for the background, along with a
video layer for each foreground object. To solve the under-
constrained decomposition problem, we propose a new loss
formulation based on multi-view consistency. We test our
method on challenging videos with complex camera motion
and show significant qualitative improvement over current
methods.
View details
ASIC: Aligning Sparse in-the-wild Image Collections
Kamal Gupta
Varun Jampani
Abhinav Shrivastava
Abhishek Kar
International Conference on Computer Vision (ICCV) (2023)
Preview abstract
We present a method for joint alignment of sparse in-thewild image collections of an object category. Most prior works assume either ground-truth keypoint annotations or a large dataset of images of a single object category. However, neither of the above assumptions hold true for the longtail of the objects present in the world. We present a selfsupervised technique that directly optimizes on a sparse collection of images of a particular object/object category to obtain consistent dense correspondences across the collection. We use pairwise nearest neighbors obtained from deep features of a pre-trained vision transformer (ViT) model as noisy and sparse keypoint matches and make them dense and accurate matches by optimizing a neural network that jointly maps the image collection into a learned canonical grid. Experiments on CUB and SPair-71k benchmarks demonstrate that our method can produce globally consistent and higher quality correspondences across the image collection when compared to existing self-supervised methods. Code and other material will be made available at https://kampta.github.io/asic.
View details
3D Moments from Near Duplicate Photos
Qianqian Wang
Zhengqi Li
Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Preview abstract
We introduce a new computational photography effect, starting from a pair of near duplicate photos that are prevalent in people's photostreams. Combining monocular depth synthesis and optical flow, we build a novel end-to-end system that can interpolate scene motion while simultaneously allowing independent control of the camera. We use our system to create short videos with scene motion and cinematic camera motion. We compare our method against two different baselines and demonstrate that our system outperforms them both qualitatively and quantitatively in publicly available benchmark datasets.
View details
Deformable Sprites for Unsupervised Video Decomposition
Vickie Ye
Zhengqi Li
Angjoo Kanazawa
Computer Vision and Pattern Recognition (CVPR) (2022)
Preview abstract
We describe a method to extract persistent elements of a dynamic scene from an input video. We represent each scene element as a Deformable Sprite consisting of three components: 1) a 2D texture image for the entire video, 2) per-frame masks for the element, and 3) non-rigid deformations that map the texture image into each video frame. The resulting decomposition allows for applications such as consistent video editing. Deformable Sprites are a type of video auto-encoder model that is optimized on individual videos, and does not require training on a large dataset, nor does it rely on pre-trained models. Moreover, our method does not require object masks or other user input, and discovers moving objects of a wider variety than previous work. We evaluate our approach on standard video datasets and show qualitative results on a diverse array of Internet videos.
View details
Dimensions of Motion: Monocular Prediction through Flow Subspaces
Richard Strong Bowen*
Ramin Zabih
Proceedings of the International Conference on 3D Vision (3DV) (2022)
Preview abstract
We introduce a way to learn to estimate a scene representation from a single image by predicting a low-dimensional subspace of optical flow for each training example, which encompasses the variety of possible camera and object movement. Supervision is provided by a novel loss which measures the distance between this predicted flow subspace and an observed optical flow. This provides a new approach to learning scene representation tasks, such as monocular depth prediction or instance segmentation, in an unsupervised fashion using in-the-wild input videos without requiring camera poses, intrinsics, or an explicit multi-view stereo step. We evaluate our method in multiple settings, including an indoor depth prediction task where it achieves comparable performance to recent methods trained with more supervision.
View details
InfiniteNature-Zero: Learning Perpetual View Generation of Natural Scenes from Single Images
Zhengqi Li
Qianqian Wang
Angjoo Kanazawa
European Conference on Computer Vision (ECCV) (2022)
Preview abstract
We present a method for learning to generate unbounded flythrough videos of natural scenes starting from a single view, where this capability is learned from a collection of single photographs, without requiring camera poses or even multiple views of each scene. To achieve this, we propose a novel self-supervised view generation training paradigm, where we sample and rendering virtual camera trajectories, including cyclic ones, allowing our model to learn stable view generation from a collection of single views. At test time, despite never seeing a video during training, our approach can take a single image and generate long camera trajectories comprised of hundreds of new views with realistic and diverse content. We compare our approach with recent state-of-the-art supervised view generation methods that require posed multi-view videos and demonstrate superior performance and synthesis quality.
View details