Mehdi S. M. Sajjadi
Mehdi S. M. Sajjadi is a team lead in Google DeepMind (Berlin, Germany) working on 3D-aware neural scene representations and deep generative models. Previously, he has been a PhD student at the Max Planck Institute for Intelligent Systems with Prof. Dr. Bernhard Schölkopf with an associated fellowship at the ETH Center for Learning Systems, after studying computer science & math at the University of Hamburg, receiving a MSc with distinction. His research on machine learning and computer vision has been published at several renowned conferences including NeurIPS, ICML, ICLR, CVPR, ICCV, and ECCV.
For more information and an up-to-date list of publications, please visit msajjadi.com.
For more information and an up-to-date list of publications, please visit msajjadi.com.
Research Areas
Authored Publications
Sort By
DORSal: Diffusion for Object-centric Representations of Scenes et al.
Allan Jabri
Emiel Hoogeboom
Thomas Kipf
International Conference on Learning Representations (2024)
Preview abstract
Recent progress in 3D scene understanding enables scalable learning of representations across large datasets of diverse scenes. As a consequence, generalization to unseen scenes and objects, rendering novel views from just a single or a handful of input images, and controllable scene generation that supports editing, is now possible. However, training jointly on a large number of scenes typically compromises rendering quality when compared to single-scene optimized models such as NeRFs. In this paper, we leverage recent progress in diffusion models to equip 3D scene representation learning models with the ability to render high-fidelity novel views, while retaining benefits such as object-level scene editing to a large degree. In particular, we propose DORSal, which adapts a video diffusion architecture for 3D scene generation conditioned on frozen object-centric slot-based representations of scenes. On both complex synthetic multi-object scenes and on the real-world large-scale Street View dataset, we show that DORSal enables scalable neural rendering of 3D scenes with object-level editing and improves upon existing approaches.
View details
Preview abstract
The Scene Representation Transformer (SRT) is a recent method to render novel views at interactive rates. Since SRT uses camera poses with respect to an arbitrarily chosen reference camera, it is not invariant to the order of the input views. As a result, SRT is not directly applicable to large-scale scenes where the reference frame would need to be changed regularly. In this work, we propose Relative Pose Attention SRT (RePAST): Instead of fixing a reference frame at the input, we inject pairwise relative camera pose information directly into the attention mechanism of the Transformers. This leads to a model that is by definition invariant to the choice of any global reference frame, while still retaining the full capabilities of the original method. Empirical results show that adding this invariance to the model does not lead to a loss in quality. We believe that this is a step towards applying fully latent transformer-based rendering methods to large-scale scenes.
View details
RUST: Latent Neural Scene Representations from Unposed Imagery
Thomas Kipf
Klaus Greff
Conference on Computer Vision and Pattern Recognition (CVPR) (2023) (to appear)
Preview abstract
Inferring the structure of 3D scenes from 2D observations is a fundamental challenge in computer vision. Recently popularized approaches based on neural scene representations have achieved tremendous impact and have been applied across a variety of applications. One of the major remaining challenges in this space is training a single model which can provide latent representations which effectively generalize beyond a single scene. Scene Representation Transformer (SRT) has shown promise in this direction, but scaling it to a larger set of diverse scenes is challenging and necessitates accurately posed ground truth data. To address this problem, we propose RUST (Really Unposed Scene representation Transformer), a pose-free approach to novel view synthesis trained on RGB images alone. Our main insight is that one can train a Pose Encoder that peeks at the target image and learns a latent pose embedding which is used by the decoder for view synthesis. We perform an empirical investigation into the learned latent pose structure and show that it allows meaningful test-time camera transformations and accurate explicit pose readouts. Perhaps surprisingly, RUST achieves similar quality as methods which have access to perfect camera pose, thereby unlocking the potential for large-scale training of amortized neural scene representations.
View details
Test-time Adaptation with Slot-centric Models
Mihir Prabhudesai
Anirudh Goyal
Gaurav Aggarwal
Thomas Kipf
Deepak Pathak
Katerina Fragkiadaki
International Conference on Machine Learning (2023), pp. 28151-28166
Preview abstract
Current visual detectors, though impressive within their training distribution, often fail to
parse out-of-distribution scenes into their constituent entities. Recent test-time adaptation methods use auxiliary self-supervised losses to adapt the network parameters to each test example independently and have shown promising results towards generalization outside the training distribution for the task of image classification. In our work, we find evidence that these losses are insufficient for the task of scene decomposition, without also considering architectural inductive biases. Recent slot-centric generative models attempt to decompose scenes into entities in a self-supervised manner by reconstructing pixels. Drawing upon these two lines of work, we propose Slot-TTA, a semi-supervised slot-centric scene decomposition model that at test time is adapted per scene through gradient descent on reconstruction or cross-view synthesis objectives. We evaluate Slot-TTA across multiple input modalities, images or 3D point clouds, and show substantial out-of-distribution performance improvements against state-of-the-art supervised feed-forward detectors, and alternative test-time adaptation methods. Project Webpage: http://slot-tta.github.io/
View details
Invariant Slot Attention: Object Discovery with Slot-Centric Reference Frames
Ondrej Biza
Gamaleldin Elsayed
Thomas Kipf
International Conference on Machine Learning (2023), pp. 2507-2527
Preview abstract
Automatically discovering composable abstractions from raw perceptual data is a long-standing challenge in machine learning. Recent slot-based neural networks that learn about objects in a self-supervised manner have made exciting progress in this direction. However, they typically fall short at adequately capturing spatial symmetries present in the visual world, which leads to sample inefficiency, such as when entangling object appearance and pose. In this paper, we present a simple yet highly effective method for incorporating spatial symmetries via slot-centric reference frames. We incorporate equivariance to per-object pose transformations into the attention and generation mechanism of Slot Attention by translating, scaling, and rotating position encodings. These changes result in little computational overhead, are easy to implement, and can result in large gains in terms of data efficiency and overall improvements to object discovery. We evaluate our method on a wide range of synthetic object discovery benchmarks namely Tetrominoes, CLEVRTex, Objects Room and MultiShapeNet, and show promising improvements on the challenging real-world Waymo Open dataset.
View details
Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations
Henning Meyer
Urs Bergmann
Klaus Greff
Noha Radwan
Alexey Dosovitskiy
Jakob Uszkoreit
Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Preview abstract
A classical problem in computer vision is to infer a 3D scene representation from few images that can be used to render novel views at interactive rates. Previous work focuses on reconstructing pre-defined 3D representations, e.g. textured meshes, or implicit representations, e.g. radiance fields, and often requires input images with precise camera poses and long processing times for each novel scene.
In this work, we propose the Scene Representation Transformer (SRT), a method which processes posed or unposed RGB images of a new area, infers a "set-latent scene representation", and synthesises novel views, all in a single feed-forward pass. To calculate the scene representation, we propose a generalization of the Vision Transformer to sets of images, enabling global information integration, and hence 3D reasoning. An efficient decoder transformer parameterizes the light field by attending into the scene representation to render novel views. Learning is supervised end-to-end by minimizing a novel-view reconstruction error.
We show that this method outperforms recent baselines in terms of PSNR and speed on synthetic datasets, including a new dataset created for the paper. Further, we demonstrate that SRT scales to support interactive visualization and semantic segmentation of real-world outdoor environments using Street View imagery.
View details
RegNeRF: Regularizing Neural Radiance Fields for View Synthesis from Sparse Inputs
Michael Niemeyer
Ben Mildenhall
Andreas Geiger
Noha Radwan
Computer Vision and Pattern Recognition (CVPR) (2022)
Preview abstract
Neural Radiance Fields (NeRF) have emerged as a powerful representation for the task of novel view synthesis due to their simplicity and state-of-the-art performance. Though NeRF can produce photorealistic renderings of unseen viewpoints when many input views are available, its performance drops significantly when this number is reduced. We observe that the majority of artifacts in sparse input scenarios are caused by errors in the estimated scene geometry, and by divergent behavior at the start of training. We address this by regularizing the geometry and appearance of patches rendered from unobserved viewpoints, and annealing the ray sampling space during training. We additionally use a normalizing flow model to regularize the color of unobserved viewpoints. Our model outperforms not only other methods that optimize over a single scene, but in many cases also conditional models that are extensively pre-trained on large multi-view datasets.
View details
Kubric: A scalable dataset generator
Anissa Yuenming Mak
Austin Stone
Carl Doersch
Cengiz Oztireli
Charles Herrmann
Daniel Rebain
Derek Nowrouzezahrai
Dmitry Lagun
Fangcheng Zhong
Florian Golemo
Francois Belletti
Henning Meyer
Hsueh-Ti (Derek) Liu
Issam Laradji
Klaus Greff
Kwang Moo Yi
Matan Sela
Noha Radwan
Thomas Kipf
Tianhao Wu
Vincent Sitzmann
Yilun Du
Yishu Miao
(2022)
Preview abstract
Data is the driving force of machine learning. The amount and quality of training data is often more important for the performance of a system than the details of its architecture. Data is also an important tool for testing specific hypothesis, and for empirically evaluating the behaviour of complex systems. Synthetic data generation represents a powerful tool that can address all these shortcomings: 1) it is cheap 2) supports rich ground-truth annotations 3) offers full control over data and 4) can circumvent privacy and legal concerns. Unfortunately the toolchain for generating data is less well developed than that for building models. We aim to improve this situation by introducing Kubric: a scalable open-source pipeline for generating realistic image and video data with rich ground truth annotations.
We also publish a collection of generated datasets and baseline results on several vision tasks.
View details
Object Scene Representation Transformer
Filip Pavetić
Leonidas Guibas
Klaus Greff
Thomas Kipf
Advances in Neural Information Processing Systems (2022), pp. 9512-9524
Preview abstract
A compositional understanding of the world in terms of objects and their geometry in 3D space is considered a cornerstone of human cognition. Facilitating the learning of such a representation in neural networks holds promise for substantially improving labeled data efficiency. As a key step in this direction, we make progress on the problem of learning 3D-consistent decompositions of complex scenes into individual objects in an unsupervised fashion. We introduce Object Scene Representation Transformer (OSRT), a 3D-centric model in which individual object representations naturally emerge through novel view synthesis. OSRT scales to significantly more complex scenes with larger diversity of objects and backgrounds than existing methods. At the same time, it is multiple orders of magnitude faster at compositional rendering thanks to its light field parametrization and the novel Slot Mixer decoder.
View details
NeSF: Neural Semantic Fields for Generalizable Semantic Segmentation of 3D Scenes
Noha Radwan*
Klaus Greff
Henning Meyer
Kyle Genova
Transactions on Machine Learning Research (2022)
Preview abstract
We present NeSF, a method for producing 3D semantic fields from pre-trained density fields and sparse 2D semantic supervision.
Our method side-steps traditional scene representations by leveraging neural representations where 3D information is stored within neural fields.
In spite of being supervised by 2D signals alone, our method is able to generate 3D-consistent semantic maps from novel camera poses and can be queried at arbitrary 3D points.
Notably, NeSF is compatible with any method producing a density field, and its accuracy improves as the quality of the pre-trained density fields improve.
Our empirical analysis demonstrates comparable quality to competitive 2D and 3D semantic segmentation baselines on convincing synthetic scenes while also offering features unavailable to existing methods.
View details