Noah Snavely

Noah Snavely

Research Areas

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Persistent Nature: A Generative Model of Unbounded 3D Worlds
    Lucy Chai
    Zhengqi Li
    Phillip Isola
    Computer Vision and Pattern Recognition (CVPR)(2023)
    Preview abstract Despite increasingly realistic image quality, recent 3D image generative models often operate on bounded domains with limited camera motions. We investigate the task of unconditionally synthesizing unbounded nature scenes, enabling arbitrarily large camera motion while maintaining a persistent 3D world model. Our scene representation consists of an extendable, planar scene layout grid, which can be rendered from arbitrary camera poses via a 3D decoder and volume rendering, and a panoramic skydome. Based on this representation, we learn a generative world model solely from single-view internet photos. Our method enables simulating long flights through 3D landscapes, while maintaining global scene consistency--for instance, returning to the starting point yields the same view of the scene. Our approach enables scene extrapolation beyond the fixed bounds of current 3D generative models, while also supporting a persistent, camera-independent world representation that stands in contrast to auto-regressive 3D prediction models. View details
    ASIC: Aligning Sparse in-the-wild Image Collections
    Kamal Gupta
    Varun Jampani
    Abhinav Shrivastava
    International Conference on Computer Vision (ICCV)(2023)
    Preview abstract We present a method for joint alignment of sparse in-thewild image collections of an object category. Most prior works assume either ground-truth keypoint annotations or a large dataset of images of a single object category. However, neither of the above assumptions hold true for the longtail of the objects present in the world. We present a selfsupervised technique that directly optimizes on a sparse collection of images of a particular object/object category to obtain consistent dense correspondences across the collection. We use pairwise nearest neighbors obtained from deep features of a pre-trained vision transformer (ViT) model as noisy and sparse keypoint matches and make them dense and accurate matches by optimizing a neural network that jointly maps the image collection into a learned canonical grid. Experiments on CUB and SPair-71k benchmarks demonstrate that our method can produce globally consistent and higher quality correspondences across the image collection when compared to existing self-supervised methods. Code and other material will be made available at https://kampta.github.io/asic. View details
    Associating Objects and their Effects in Unconstrained Monocular Video
    Erika Lu
    Zhengqi Li
    Leonid Sigal
    IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2023
    Preview abstract We propose a method to decompose a video into a back- ground and a set of foreground layers, where the back- ground captures stationary elements while the foreground layers capture moving objects along with their associated effects (e.g. shadows and reflections). Our approach is de- signed for unconstrained monocular videos, with arbitrary camera and object motion. Prior work that tackles this problem assumes that the video can be mapped onto a fixed 2D canvas, severely limiting the possible space of camera motion. Instead, our method applies recent progress in monocular camera pose and depth estimation to create a full, RGBD video layer for the background, along with a video layer for each foreground object. To solve the under- constrained decomposition problem, we propose a new loss formulation based on multi-view consistency. We test our method on challenging videos with complex camera motion and show significant qualitative improvement over current methods. View details
    DynIBaR: Neural Dynamic Image-Based Rendering
    Zhengqi Li
    Qianqian Wang
    Computer Vision and Pattern Recognition (CVPR)(2023)
    Preview abstract We address the problem of synthesizing novel views from a monocular video depicting complex dynamic scenes. State-of-the-art methods based on temporally varying Neural Radiance Fields (aka \emph{dynamic NeRFs}) have shown impressive results on this task. However, for long videos with complex object motions and uncontrolled camera trajectories, these methods can produce blurry or inaccurate renderings, hampering their use in real-world applications. Rather than encoding a dynamic scene within the weights of MLPs, we present a new method that addresses these limitations by adopting a volumetric image-based rendering framework that synthesizes new viewpoints by aggregating features from nearby views in a scene-motion-aware manner. Our system preserves the advantages for modeling complex scenes and view-dependent effects, but enables synthesizing photo-realistic novel views from long videos featuring complex scene dynamics with unconstrained camera trajectories. We demonstrate significant improvements over state-of-the-art methods on dynamic scene datasets, and also apply our approach to in-the-wild videos with challenging camera and object motion, where prior methods fail to produce high-quality renderings. View details
    3D Moments from Near Duplicate Photos
    Qianqian Wang
    Zhengqi Li
    Conference on Computer Vision and Pattern Recognition (CVPR)(2022)
    Preview abstract We introduce a new computational photography effect, starting from a pair of near duplicate photos that are prevalent in people's photostreams. Combining monocular depth synthesis and optical flow, we build a novel end-to-end system that can interpolate scene motion while simultaneously allowing independent control of the camera. We use our system to create short videos with scene motion and cinematic camera motion. We compare our method against two different baselines and demonstrate that our system outperforms them both qualitatively and quantitatively in publicly available benchmark datasets. View details
    Deformable Sprites for Unsupervised Video Decomposition
    Vickie Ye
    Zhengqi Li
    Angjoo Kanazawa
    Computer Vision and Pattern Recognition (CVPR)(2022)
    Preview abstract We describe a method to extract persistent elements of a dynamic scene from an input video. We represent each scene element as a Deformable Sprite consisting of three components: 1) a 2D texture image for the entire video, 2) per-frame masks for the element, and 3) non-rigid deformations that map the texture image into each video frame. The resulting decomposition allows for applications such as consistent video editing. Deformable Sprites are a type of video auto-encoder model that is optimized on individual videos, and does not require training on a large dataset, nor does it rely on pre-trained models. Moreover, our method does not require object masks or other user input, and discovers moving objects of a wider variety than previous work. We evaluate our approach on standard video datasets and show qualitative results on a diverse array of Internet videos. View details
    InfiniteNature-Zero: Learning Perpetual View Generation of Natural Scenes from Single Images
    Zhengqi Li
    Qianqian Wang
    Angjoo Kanazawa
    European Conference on Computer Vision (ECCV)(2022)
    Preview abstract We present a method for learning to generate unbounded flythrough videos of natural scenes starting from a single view, where this capability is learned from a collection of single photographs, without requiring camera poses or even multiple views of each scene. To achieve this, we propose a novel self-supervised view generation training paradigm, where we sample and rendering virtual camera trajectories, including cyclic ones, allowing our model to learn stable view generation from a collection of single views. At test time, despite never seeing a video during training, our approach can take a single image and generate long camera trajectories comprised of hundreds of new views with realistic and diverse content. We compare our approach with recent state-of-the-art supervised view generation methods that require posed multi-view videos and demonstrate superior performance and synthesis quality. View details
    Dimensions of Motion: Monocular Prediction through Flow Subspaces
    Richard Strong Bowen*
    Ramin Zabih
    Proceedings of the International Conference on 3D Vision (3DV)(2022)
    Preview abstract We introduce a way to learn to estimate a scene representation from a single image by predicting a low-dimensional subspace of optical flow for each training example, which encompasses the variety of possible camera and object movement. Supervision is provided by a novel loss which measures the distance between this predicted flow subspace and an observed optical flow. This provides a new approach to learning scene representation tasks, such as monocular depth prediction or instance segmentation, in an unsupervised fashion using in-the-wild input videos without requiring camera poses, intrinsics, or an explicit multi-view stereo step. We evaluate our method in multiple settings, including an indoor depth prediction task where it achieves comparable performance to recent methods trained with more supervision. View details
    IBRNet: Learning Multi-View Image-Based Rendering
    Kyle Genova
    Pratul Srinivasan
    Qianqian Wang
    Ricardo Martin-Brualla
    Zhicheng Wang
    Conference on Computer Vision and Pattern Recognition (CVPR), IEEE(2021) (to appear)
    Preview abstract We present a method that synthesizes novel views of complex scenes by interpolating a sparse set of nearby views.The core of our method is a multilayer perceptron (MLP)that generates RGBA at each 5D coordinate from multi-view image features. Unlike neural scene representation work that optimizes per-scene functions for rendering, we learn a generic view interpolation function that naturally generalizes to novel scene types and camera setups. Compared to previous generic image-based rendering (IBR) methods like Multiple-plane images (MPIs) that use discrete volume representations, our method instead produces RGBAs at continuous 5D locations (3D spatial locations and 2D viewing directions), enabling high-resolution imagery rendering.Our rendering pipeline is fully differentiable, and the only input required to train our method are multi-view posed images. Experiments show that our method outperforms previous IBR methods, and achieves state-of-the-art performance when fine tuned on each test scene. View details
    Preview abstract We introduce the problem of perpetual view generation—long-range generation of novel views corresponding to an arbitrarily long camera trajectory given a single image. This is a challenging problem that goes far beyond the capabilities of current view synthesis methods, which work for a limited range of viewpoints and quickly degenerate when presented with a large camera motion. Methods designed for video generation also have limited ability to produce long video sequences and are often agnostic to scene geometry. We take a hybrid approach that integrates both geometry and image synthesis in an iterative render, refine, and repeat framework, allowing for long-range generation that cover large distances after hundreds of frames. Our approach can be trained from a set of monocular video sequences without any manual annotation. We propose a dataset of aerial footage of natural coastal scenes, and compare our method with recent view synthesis and conditional video generation baselines, showing that it can generate plausible scenes for much longer time horizons over large camera trajectories compared to existing methods. View details