Conference on Computer Vision and Pattern Recognition (CVPR) (2023) (to appear)
Inferring the structure of 3D scenes from 2D observations is a fundamental challenge in computer vision. Recently popularized approaches based on neural scene representations have achieved tremendous impact and have been applied across a variety of applications. One of the major remaining challenges in this space is training a single model which can provide latent representations which effectively generalize beyond a single scene. Scene Representation Transformer (SRT) has shown promise in this direction, but scaling it to a larger set of diverse scenes is challenging and necessitates accurately posed ground truth data. To address this problem, we propose RUST (Really Unposed Scene representation Transformer), a pose-free approach to novel view synthesis trained on RGB images alone. Our main insight is that one can train a Pose Encoder that peeks at the target image and learns a latent pose embedding which is used by the decoder for view synthesis. We perform an empirical investigation into the learned latent pose structure and show that it allows meaningful test-time camera transformations and accurate explicit pose readouts. Perhaps surprisingly, RUST achieves similar quality as methods which have access to perfect camera pose, thereby unlocking the potential for large-scale training of amortized neural scene representations.View details
Data is the driving force of machine learning. The amount and quality of training data is often more important for the performance of a system than the details of its architecture. Data is also an important tool for testing specific hypothesis, and for empirically evaluating the behaviour of complex systems. Synthetic data generation represents a powerful tool that can address all these shortcomings: 1) it is cheap 2) supports rich ground-truth annotations 3) offers full control over data and 4) can circumvent privacy and legal concerns. Unfortunately the toolchain for generating data is less well developed than that for building models. We aim to improve this situation by introducing Kubric: a scalable open-source pipeline for generating realistic image and video data with rich ground truth annotations.
We also publish a collection of generated datasets and baseline results on several vision tasks.View details
Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
A classical problem in computer vision is to infer a 3D scene representation from few images that can be used to render novel views at interactive rates. Previous work focuses on reconstructing pre-defined 3D representations, e.g. textured meshes, or implicit representations, e.g. radiance fields, and often requires input images with precise camera poses and long processing times for each novel scene.
In this work, we propose the Scene Representation Transformer (SRT), a method which processes posed or unposed RGB images of a new area, infers a "set-latent scene representation", and synthesises novel views, all in a single feed-forward pass. To calculate the scene representation, we propose a generalization of the Vision Transformer to sets of images, enabling global information integration, and hence 3D reasoning. An efficient decoder transformer parameterizes the light field by attending into the scene representation to render novel views. Learning is supervised end-to-end by minimizing a novel-view reconstruction error.
We show that this method outperforms recent baselines in terms of PSNR and speed on synthetic datasets, including a new dataset created for the paper. Further, we demonstrate that SRT scales to support interactive visualization and semantic segmentation of real-world outdoor environments using Street View imagery.View details
We present NeSF, a method for producing 3D semantic fields from pre-trained density fields and sparse 2D semantic supervision.
Our method side-steps traditional scene representations by leveraging neural representations where 3D information is stored within neural fields.
In spite of being supervised by 2D signals alone, our method is able to generate 3D-consistent semantic maps from novel camera poses and can be queried at arbitrary 3D points.
Notably, NeSF is compatible with any method producing a density field, and its accuracy improves as the quality of the pre-trained density fields improve.
Our empirical analysis demonstrates comparable quality to competitive 2D and 3D semantic segmentation baselines on convincing synthetic scenes while also offering features unavailable to existing methods.View details
We show that generating English Wikipedia articles can be approached as a multi-
document summarization of source documents. We use extractive summarization
to coarsely identify salient information and a neural abstractive model to generate
the article. For the abstractive model, we introduce a decoder-only architecture
that can scalably attend to very long sequences, much longer than typical encoder-
decoder architectures used in sequence transduction. We show that this model can
generate fluent, coherent multi-sentence paragraphs and even whole Wikipedia
articles. When given reference documents, we show it can extract relevant factual
information as reflected in perplexity, ROUGE scores and human evaluations.View details
In robotic application we often face the challenge of detecting instances of objects for which we have neither trained models or very little labeled data. In this paper we propose to use self-supervisory signals, generated without human supervision by a robot exploring an environment, to learn a representation of the novel object instances present in this environment. We demonstrate the utility of this representation in two ways. First, we can automatically discover objects by performing clustering in this space. Each resulting cluster contains examples of one instance seen from various viewpoints and scales. Second, if given a small number of labeled images, we can learn efficiently detectors for these labels. In the few-shot regime these detectors have a substantially higher mAP of XX compared to off-the-shelf standard detectors trained on this limited data. Thus, the self-supervision results in efficient and performant object discovery and detection at no or very small human labeling cost.View details
No Results Found
We're always looking for more talented, passionate people.