Generalizable Patch-Based Neural Rendering
Abstract
Neural rendering has received tremendous attention since the advent of
Neural Radiance Fields (NeRF), and has pushed the state-of-the-art on
novel-view synthesis considerably. The recent focus has been on models
that overfit to a single scene, and the few attempts to learn models
that can synthesize novel views of unseen scenes mostly consist of
combining deep convolutional features with a NeRF-like model. We
propose a different paradigm, where no deep visual features and no
NeRF-like volume rendering are needed. Our method is capable of
predicting the color of a target ray in a novel scene directly, just
from a collection of patches sampled from the scene. We first leverage
epipolar geometry to extract patches along the epipolar lines of each
reference view. Each patch is linearly projected into a 1D feature
vector and a sequence of transformers process the collection. For
positional encoding, we parameterize rays as in a light field
representation, with the crucial difference that the coordinates are
canonicalized with respect to the target ray, which makes our method
independent of the reference frame and improves generalization. We
show that our approach outperforms the state-of-the-art on novel view
synthesis of unseen scenes even when being trained with considerably
less data than prior work. Our code is available at
https://mohammedsuhail.net/gen_patch_neural_rendering.
Neural Radiance Fields (NeRF), and has pushed the state-of-the-art on
novel-view synthesis considerably. The recent focus has been on models
that overfit to a single scene, and the few attempts to learn models
that can synthesize novel views of unseen scenes mostly consist of
combining deep convolutional features with a NeRF-like model. We
propose a different paradigm, where no deep visual features and no
NeRF-like volume rendering are needed. Our method is capable of
predicting the color of a target ray in a novel scene directly, just
from a collection of patches sampled from the scene. We first leverage
epipolar geometry to extract patches along the epipolar lines of each
reference view. Each patch is linearly projected into a 1D feature
vector and a sequence of transformers process the collection. For
positional encoding, we parameterize rays as in a light field
representation, with the crucial difference that the coordinates are
canonicalized with respect to the target ray, which makes our method
independent of the reference frame and improves generalization. We
show that our approach outperforms the state-of-the-art on novel view
synthesis of unseen scenes even when being trained with considerably
less data than prior work. Our code is available at
https://mohammedsuhail.net/gen_patch_neural_rendering.