Jump to Content
Mehdi S. M. Sajjadi

Mehdi S. M. Sajjadi

Mehdi S. M. Sajjadi is a machine learning researcher in Google Brain working on deep generative models. Further research interests are in computational imaging and in the accurate evaluation of generative models. Previously, he has been a PhD student at the Max Planck Institute for Intelligent Systems with Prof. Dr. Bernhard Schölkopf with an associated fellowship at the ETH Center for Learning Systems, after studying computer science & math at the University of Hamburg, receiving a MSc with distinction. His research on machine learning and computer vision has been published at several renowned conferences including NeurIPS, ICML, ICLR, CVPR, ICCV, and ECCV. For more information, see msajjadi.com.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    DORSal: Diffusion for Object-centric Representations of Scenes et al.
    Allan Jabri
    Emiel Hoogeboom
    Thomas Kipf
    International Conference on Learning Representations (2024)
    Preview abstract Recent progress in 3D scene understanding enables scalable learning of representations across large datasets of diverse scenes. As a consequence, generalization to unseen scenes and objects, rendering novel views from just a single or a handful of input images, and controllable scene generation that supports editing, is now possible. However, training jointly on a large number of scenes typically compromises rendering quality when compared to single-scene optimized models such as NeRFs. In this paper, we leverage recent progress in diffusion models to equip 3D scene representation learning models with the ability to render high-fidelity novel views, while retaining benefits such as object-level scene editing to a large degree. In particular, we propose DORSal, which adapts a video diffusion architecture for 3D scene generation conditioned on frozen object-centric slot-based representations of scenes. On both complex synthetic multi-object scenes and on the real-world large-scale Street View dataset, we show that DORSal enables scalable neural rendering of 3D scenes with object-level editing and improves upon existing approaches. View details
    Preview abstract Inferring the structure of 3D scenes from 2D observations is a fundamental challenge in computer vision. Recently popularized approaches based on neural scene representations have achieved tremendous impact and have been applied across a variety of applications. One of the major remaining challenges in this space is training a single model which can provide latent representations which effectively generalize beyond a single scene. Scene Representation Transformer (SRT) has shown promise in this direction, but scaling it to a larger set of diverse scenes is challenging and necessitates accurately posed ground truth data. To address this problem, we propose RUST (Really Unposed Scene representation Transformer), a pose-free approach to novel view synthesis trained on RGB images alone. Our main insight is that one can train a Pose Encoder that peeks at the target image and learns a latent pose embedding which is used by the decoder for view synthesis. We perform an empirical investigation into the learned latent pose structure and show that it allows meaningful test-time camera transformations and accurate explicit pose readouts. Perhaps surprisingly, RUST achieves similar quality as methods which have access to perfect camera pose, thereby unlocking the potential for large-scale training of amortized neural scene representations. View details
    Test-time Adaptation with Slot-centric Models
    Mihir Prabhudesai
    Anirudh Goyal
    Gaurav Aggarwal
    Thomas Kipf
    Deepak Pathak
    Katerina Fragkiadaki
    International Conference on Machine Learning (2023), pp. 28151-28166
    Preview abstract Current visual detectors, though impressive within their training distribution, often fail to parse out-of-distribution scenes into their constituent entities. Recent test-time adaptation methods use auxiliary self-supervised losses to adapt the network parameters to each test example independently and have shown promising results towards generalization outside the training distribution for the task of image classification. In our work, we find evidence that these losses are insufficient for the task of scene decomposition, without also considering architectural inductive biases. Recent slot-centric generative models attempt to decompose scenes into entities in a self-supervised manner by reconstructing pixels. Drawing upon these two lines of work, we propose Slot-TTA, a semi-supervised slot-centric scene decomposition model that at test time is adapted per scene through gradient descent on reconstruction or cross-view synthesis objectives. We evaluate Slot-TTA across multiple input modalities, images or 3D point clouds, and show substantial out-of-distribution performance improvements against state-of-the-art supervised feed-forward detectors, and alternative test-time adaptation methods. Project Webpage: http://slot-tta.github.io/ View details
    Preview abstract The Scene Representation Transformer (SRT) is a recent method to render novel views at interactive rates. Since SRT uses camera poses with respect to an arbitrarily chosen reference camera, it is not invariant to the order of the input views. As a result, SRT is not directly applicable to large-scale scenes where the reference frame would need to be changed regularly. In this work, we propose Relative Pose Attention SRT (RePAST): Instead of fixing a reference frame at the input, we inject pairwise relative camera pose information directly into the attention mechanism of the Transformers. This leads to a model that is by definition invariant to the choice of any global reference frame, while still retaining the full capabilities of the original method. Empirical results show that adding this invariance to the model does not lead to a loss in quality. We believe that this is a step towards applying fully latent transformer-based rendering methods to large-scale scenes. View details
    Preview abstract Automatically discovering composable abstractions from raw perceptual data is a long-standing challenge in machine learning. Recent slot-based neural networks that learn about objects in a self-supervised manner have made exciting progress in this direction. However, they typically fall short at adequately capturing spatial symmetries present in the visual world, which leads to sample inefficiency, such as when entangling object appearance and pose. In this paper, we present a simple yet highly effective method for incorporating spatial symmetries via slot-centric reference frames. We incorporate equivariance to per-object pose transformations into the attention and generation mechanism of Slot Attention by translating, scaling, and rotating position encodings. These changes result in little computational overhead, are easy to implement, and can result in large gains in terms of data efficiency and overall improvements to object discovery. We evaluate our method on a wide range of synthetic object discovery benchmarks namely Tetrominoes, CLEVRTex, Objects Room and MultiShapeNet, and show promising improvements on the challenging real-world Waymo Open dataset. View details
    Object Scene Representation Transformer
    Filip Pavetić
    Leonidas Guibas
    Klaus Greff
    Thomas Kipf
    Advances in Neural Information Processing Systems (2022), pp. 9512-9524
    Preview abstract A compositional understanding of the world in terms of objects and their geometry in 3D space is considered a cornerstone of human cognition. Facilitating the learning of such a representation in neural networks holds promise for substantially improving labeled data efficiency. As a key step in this direction, we make progress on the problem of learning 3D-consistent decompositions of complex scenes into individual objects in an unsupervised fashion. We introduce Object Scene Representation Transformer (OSRT), a 3D-centric model in which individual object representations naturally emerge through novel view synthesis. OSRT scales to significantly more complex scenes with larger diversity of objects and backgrounds than existing methods. At the same time, it is multiple orders of magnitude faster at compositional rendering thanks to its light field parametrization and the novel Slot Mixer decoder. View details
    Kubric: A scalable dataset generator
    Anissa Yuenming Mak
    Austin Stone
    Carl Doersch
    Cengiz Oztireli
    Charles Herrmann
    Daniel Rebain
    Derek Nowrouzezahrai
    Dmitry Lagun
    Fangcheng Zhong
    Florian Golemo
    Francois Belletti
    Henning Meyer
    Hsueh-Ti (Derek) Liu
    Issam Laradji
    Klaus Greff
    Kwang Moo Yi
    Matan Sela
    Noha Radwan
    Thomas Kipf
    Tianhao Wu
    Vincent Sitzmann
    Yilun Du
    Yishu Miao
    Preview abstract Data is the driving force of machine learning. The amount and quality of training data is often more important for the performance of a system than the details of its architecture. Data is also an important tool for testing specific hypothesis, and for empirically evaluating the behaviour of complex systems. Synthetic data generation represents a powerful tool that can address all these shortcomings: 1) it is cheap 2) supports rich ground-truth annotations 3) offers full control over data and 4) can circumvent privacy and legal concerns. Unfortunately the toolchain for generating data is less well developed than that for building models. We aim to improve this situation by introducing Kubric: a scalable open-source pipeline for generating realistic image and video data with rich ground truth annotations. We also publish a collection of generated datasets and baseline results on several vision tasks. View details
    Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations
    Henning Meyer
    Urs Bergmann
    Klaus Greff
    Noha Radwan
    Alexey Dosovitskiy
    Jakob Uszkoreit
    Tom Funkhouser
    Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
    Preview abstract A classical problem in computer vision is to infer a 3D scene representation from few images that can be used to render novel views at interactive rates. Previous work focuses on reconstructing pre-defined 3D representations, e.g. textured meshes, or implicit representations, e.g. radiance fields, and often requires input images with precise camera poses and long processing times for each novel scene. In this work, we propose the Scene Representation Transformer (SRT), a method which processes posed or unposed RGB images of a new area, infers a "set-latent scene representation", and synthesises novel views, all in a single feed-forward pass. To calculate the scene representation, we propose a generalization of the Vision Transformer to sets of images, enabling global information integration, and hence 3D reasoning. An efficient decoder transformer parameterizes the light field by attending into the scene representation to render novel views. Learning is supervised end-to-end by minimizing a novel-view reconstruction error. We show that this method outperforms recent baselines in terms of PSNR and speed on synthetic datasets, including a new dataset created for the paper. Further, we demonstrate that SRT scales to support interactive visualization and semantic segmentation of real-world outdoor environments using Street View imagery. View details
    RegNeRF: Regularizing Neural Radiance Fields for View Synthesis from Sparse Inputs
    Michael Niemeyer
    Ben Mildenhall
    Andreas Geiger
    Noha Radwan
    Computer Vision and Pattern Recognition (CVPR) (2022)
    Preview abstract Neural Radiance Fields (NeRF) have emerged as a powerful representation for the task of novel view synthesis due to their simplicity and state-of-the-art performance. Though NeRF can produce photorealistic renderings of unseen viewpoints when many input views are available, its performance drops significantly when this number is reduced. We observe that the majority of artifacts in sparse input scenarios are caused by errors in the estimated scene geometry, and by divergent behavior at the start of training. We address this by regularizing the geometry and appearance of patches rendered from unobserved viewpoints, and annealing the ray sampling space during training. We additionally use a normalizing flow model to regularize the color of unobserved viewpoints. Our model outperforms not only other methods that optimize over a single scene, but in many cases also conditional models that are extensively pre-trained on large multi-view datasets. View details
    Preview abstract We present NeSF, a method for producing 3D semantic fields from pre-trained density fields and sparse 2D semantic supervision. Our method side-steps traditional scene representations by leveraging neural representations where 3D information is stored within neural fields. In spite of being supervised by 2D signals alone, our method is able to generate 3D-consistent semantic maps from novel camera poses and can be queried at arbitrary 3D points. Notably, NeSF is compatible with any method producing a density field, and its accuracy improves as the quality of the pre-trained density fields improve. Our empirical analysis demonstrates comparable quality to competitive 2D and 3D semantic segmentation baselines on convincing synthetic scenes while also offering features unavailable to existing methods. View details
    NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections
    Ricardo Martin-Brualla*
    Noha Radwan*
    Alexey Dosovitskiy
    Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
    Preview abstract We present a learning-based method for synthesizing novel views of complex scenes using only unstructured collections of in-the-wild photographs. We build on Neural Radiance Fields (NeRF), which uses the weights of a multilayer perceptron to model the density and color of a scene as a function of 3D coordinates. While NeRF works well on images of static subjects captured under controlled settings, it is incapable of modeling many ubiquitous, real-world phenomena in uncontrolled images, such as variable illumination or transient occluders. We introduce a series of extensions to NeRF to address these issues, thereby enabling accurate reconstructions from unstructured image collections taken from the internet. We apply our system, dubbed NeRF-W, to internet photo collections of famous landmarks, and demonstrate temporally consistent novel view renderings that are significantly closer to photorealism than the prior state of the art. View details
    Preview abstract Recent advances in generative modeling have led to an increased interest in the study of statistical divergences as means of model comparison. Commonly used evaluation methods, such as Fr\'echet Inception Distance (FID), correlate well with the perceived quality of samples and are sensitive to mode dropping. However, these metrics are unable to distinguish between different failure cases since they yield one-dimensional scores. We propose a novel definition of precision and recall for distributions which disentangles the divergence into two separate dimensions. The proposed notion is intuitive, retains desirable properties, and naturally leads to an efficient algorithm that can be used to evaluate generative models. We relate this notion to total variation as well as to recent evaluation metrics such as Inception Score and FID. To demonstrate the practical utility of the proposed approach we perform an empirical study on several variants of Generative Adversarial Networks and the Variational Autoencoder. In an extensive set of experiments we show that the proposed metric is able to disentangle the quality of generated samples from the coverage of the target distribution. View details
    Preview abstract Recent advances in video super-resolution have shown that convolutional neural networks combined with motion compensation are able to merge information from multiple low-resolution (LR) frames to create high-quality results. Current state-of-the-art methods process a batch of LR frames to generate a single high-resolution (HR) frame and run this scheme in a sliding window fashion over the entire video, effectively treating the problem as many independent multi-frame super-resolution tasks. This approach has two main weaknesses: 1) Each input frame is processed and warped multiple times, leading to redundant computations, and 2) each output frame is estimated independently, limiting the system's ability to produce temporally consistent results. In this work, we propose an end-to-end trainable frame-recursive video super-resolution framework that uses the previously inferred HR estimate to super-resolve the subsequent frame. This naturally encourages temporally consistent results and avoids redundant computations by warping only one image in each step. Furthermore, due to its recurrent nature, the proposed method has the ability to assimilate a large number of previous frames without increased computational demands. Extensive evaluations and comparisons with previous methods validate the strengths of our approach and demonstrate that the proposed framework is able to significantly outperform the current state of the art. View details
    From Variational to Deterministic Autoencoders
    Partha Ghosh*
    Antonio Vergari
    Michael Black
    Bernhard Schölkopf
    International Conference on Learning Representations (ICLR) (2020)
    Perceptual Video Super Resolution with Enhanced Temporal Consistency
    Eduardo Pérez-Pellitero
    Michael Hirsch
    Bernhard Schölkopf
    European Conference on Computer Vision (ECCV) Workshop Perceptual Image Restoration and Manipulation (PIRM) (2018)
    Tempered Adversarial Networks
    Giambattista Parascandolo
    Arash Mehrjou
    Bernhard Schölkopf
    International Conference on Machine Learning (ICML) (2018)
    Spatio-Temporal Transformer Network for Video Restoration
    Tae Hyun Kim
    Michael Hirsch
    Bernhard Schölkopf
    European Conference on Computer Vision (ECCV) (2018)
    EnhanceNet: Single Image Super-Resolution Through Automated Texture Synthesis
    Bernhard Schölkopf
    Michael Hirsch
    International Conference on Computer Vision (ICCV) (2017)
    Depth Estimation Through a Generative Model of Light Field Synthesis
    Rolf Köhler
    Bernhard Schölkopf
    Michael Hirsch
    German Conference on Pattern Recognition (GCPR) (2016)
    Peer grading in a course on algorithms and data structures
    Morteza Alamgir
    Ulrike von Luxburg
    Third Annual ACM Conference on Learning at Scale L@S (2015)