Sjoerd van Steenkiste

Sjoerd van Steenkiste

Sjoerd is a research scientist at Google Research. His current research focus is twofold: (1) Analyzing LLMs through the lens of human cognition and improving their reasoning capabilities; and (2) Approaches to learning representation of 4D scenes that capture meaningful structure (objects, geometry, etc.). More broadly, he is interested in multimodality (eg. combining vision + language), compositional generalization, learning structured 'symbol-like' representations with neural networks, and the binding problem. Before joining Google he was a Postdoc at the Dalle Molle Institute for Artificial Intelligence (IDSIA) in the Italian-speaking part of Switzerland with Jürgen Schmidhuber, which is also where he completed his PhD in 2020. Prior to joining Google as a Research Scientist he was an intern with the Google Brain team in Zurich in 2018.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    DORSal: Diffusion for Object-centric Representations of Scenes et al.
    Allan Jabri
    Emiel Hoogeboom
    Thomas Kipf
    International Conference on Learning Representations (2024)
    Preview abstract Recent progress in 3D scene understanding enables scalable learning of representations across large datasets of diverse scenes. As a consequence, generalization to unseen scenes and objects, rendering novel views from just a single or a handful of input images, and controllable scene generation that supports editing, is now possible. However, training jointly on a large number of scenes typically compromises rendering quality when compared to single-scene optimized models such as NeRFs. In this paper, we leverage recent progress in diffusion models to equip 3D scene representation learning models with the ability to render high-fidelity novel views, while retaining benefits such as object-level scene editing to a large degree. In particular, we propose DORSal, which adapts a video diffusion architecture for 3D scene generation conditioned on frozen object-centric slot-based representations of scenes. On both complex synthetic multi-object scenes and on the real-world large-scale Street View dataset, we show that DORSal enables scalable neural rendering of 3D scenes with object-level editing and improves upon existing approaches. View details
    DreamSync: Aligning Text-to-Image Generation with Image Understanding Models
    Jiao Sun
    Yushi Hu
    Deqing Fu
    Royi Rassin
    Su Wang
    Charles Herrmann
    Ranjay Krishna
    Synthetic Data for Computer Vision Workshop @ CVPR 2024
    Preview abstract Text-to-Image (T2I) models still struggle to produce images that are both beautiful and faithful to the user's input text prompt. Recent frameworks to evaluate the faithfulness of T2I models, such as TIFA, have observed that large vision-language models (VLMs) can reliably analyze the generated images and measure the alignment to the text prompts. Building on this insight, we introduce DreamSync, a model-agnostic training algorithm that utilizes VLM feedback to improve T2I models. The main idea behind DreamSync is to bootstrap T2I models with their own generations. First, we use the T2I model to generate several candidate images. Then, we use two VLMs as data selectors: one is a Visual Question Answering (VQA) model that measures the alignment of generated images to user prompts, and the other measures the image aesthetic quality. After selecting the top candidate images, we use LoRA to iteratively fine-tune the T2I model. Despite its simplicity, DreamSync improves both the semantic alignment and aesthetic appeal of two diffusion-based T2I models, evidenced by multiple benchmarks (+1.77% on TIFA, +2.8% on DSG1K, +3.76% on VILA aesthetic) and human evaluations. DreamSync does not need any additional human annotation, model architecture changes, or reinforcement learning. View details
    Test-time Adaptation with Slot-centric Models
    Mihir Prabhudesai
    Anirudh Goyal
    Gaurav Aggarwal
    Thomas Kipf
    Deepak Pathak
    Katerina Fragkiadaki
    International Conference on Machine Learning (2023), pp. 28151-28166
    Preview abstract Current visual detectors, though impressive within their training distribution, often fail to parse out-of-distribution scenes into their constituent entities. Recent test-time adaptation methods use auxiliary self-supervised losses to adapt the network parameters to each test example independently and have shown promising results towards generalization outside the training distribution for the task of image classification. In our work, we find evidence that these losses are insufficient for the task of scene decomposition, without also considering architectural inductive biases. Recent slot-centric generative models attempt to decompose scenes into entities in a self-supervised manner by reconstructing pixels. Drawing upon these two lines of work, we propose Slot-TTA, a semi-supervised slot-centric scene decomposition model that at test time is adapted per scene through gradient descent on reconstruction or cross-view synthesis objectives. We evaluate Slot-TTA across multiple input modalities, images or 3D point clouds, and show substantial out-of-distribution performance improvements against state-of-the-art supervised feed-forward detectors, and alternative test-time adaptation methods. Project Webpage: http://slot-tta.github.io/ View details
    Unsupervised Learning of Temporal Abstractions with Slot-based Transformers
    Anand Gopalakrishnan
    Jürgen Schmidhuber
    Kazuki Irie
    Neural Computation, 35 (2023), pp. 593-626
    Preview abstract The discovery of reusable sub-routines simplifies decision-making and planning in complex reinforcement learning problems. Previous approaches propose to learn such temporal abstractions in an unsupervised fashion through observing state-action trajectories gathered from executing a policy. However, a current limitation is that they process each trajectory in an entirely sequential manner, which prevents them from revising earlier decisions about sub-routine boundary points in light of new incoming information. In this work we propose SloTTAr, a fully parallel approach that integrates sequence processing Transformers with a Slot Attention module to discover sub-routines in an unsupervised fashion, while leveraging adaptive computation for learning about the number of such sub-routines solely based on their empirical distribution. We demonstrate how SloTTAr is capable of outperforming strong baselines in terms of boundary point discovery, even for sequences containing variable amounts of sub-routines, while being up to 7x faster to train on existing benchmarks. View details
    Invariant Slot Attention: Object Discovery with Slot-Centric Reference Frames
    Ondrej Biza
    Gamaleldin Elsayed
    Thomas Kipf
    International Conference on Machine Learning (2023), pp. 2507-2527
    Preview abstract Automatically discovering composable abstractions from raw perceptual data is a long-standing challenge in machine learning. Recent slot-based neural networks that learn about objects in a self-supervised manner have made exciting progress in this direction. However, they typically fall short at adequately capturing spatial symmetries present in the visual world, which leads to sample inefficiency, such as when entangling object appearance and pose. In this paper, we present a simple yet highly effective method for incorporating spatial symmetries via slot-centric reference frames. We incorporate equivariance to per-object pose transformations into the attention and generation mechanism of Slot Attention by translating, scaling, and rotating position encodings. These changes result in little computational overhead, are easy to implement, and can result in large gains in terms of data efficiency and overall improvements to object discovery. We evaluate our method on a wide range of synthetic object discovery benchmarks namely Tetrominoes, CLEVRTex, Objects Room and MultiShapeNet, and show promising improvements on the challenging real-world Waymo Open dataset. View details
    Scaling Vision Transformers to 22 Billion Parameters
    Josip Djolonga
    Basil Mustafa
    Piotr Padlewski
    Justin Gilmer
    Mathilde Caron
    Rodolphe Jenatton
    Lucas Beyer
    Michael Tschannen
    Anurag Arnab
    Carlos Riquelme
    Gamaleldin Elsayed
    Fisher Yu
    Avital Oliver
    Fantine Huot
    Mark Collier
    Vighnesh Birodkar
    Yi Tay
    Alexander Kolesnikov
    Filip Pavetić
    Thomas Kipf
    Xiaohua Zhai
    Neil Houlsby
    Arxiv (2023)
    Preview abstract The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modeling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters. We present a recipe for highly efficient training of a 22B-parameter ViT and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features) ViT22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between bias and performance, an improved alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT22B demonstrates the potential for "LLM-like'' scaling in vision, and provides key steps towards getting there. View details
    Exploring through Random Curiosity with General Value Functions
    Aditya Ramesh
    Louis Kirsch
    Jürgen Schmidhuber
    Advances in Neural Information Processing Systems (2022), pp. 18733-18748
    Preview abstract Efficient exploration in reinforcement learning is a challenging problem commonly addressed through intrinsic rewards. Recent prominent approaches are based on state novelty or variants of artificial curiosity. However, directly applying them to partially observable environments can be ineffective and lead to premature dissipation of intrinsic rewards. Here we propose random curiosity with general value functions (RC-GVF), a novel intrinsic reward function that draws upon connections between these distinct approaches. Instead of using only the current observation’s novelty or a curiosity bonus for failing to predict precise environment dynamics, RC-GVF derives intrinsic rewards through predicting temporally extended general value functions. We demonstrate that this improves exploration in a hard-exploration diabolical lock problem. Furthermore, RC-GVF significantly outperforms previous methods in the absence of ground-truth episodic counts in the partially observable MiniGrid environments. Panoramic observations on MiniGrid further boost RC GVF’s performance such that it is competitive to baselines exploiting privileged information in form of episodic counts. View details
    Object Scene Representation Transformer
    Filip Pavetić
    Leonidas Guibas
    Klaus Greff
    Thomas Kipf
    Advances in Neural Information Processing Systems (2022), pp. 9512-9524
    Preview abstract A compositional understanding of the world in terms of objects and their geometry in 3D space is considered a cornerstone of human cognition. Facilitating the learning of such a representation in neural networks holds promise for substantially improving labeled data efficiency. As a key step in this direction, we make progress on the problem of learning 3D-consistent decompositions of complex scenes into individual objects in an unsupervised fashion. We introduce Object Scene Representation Transformer (OSRT), a 3D-centric model in which individual object representations naturally emerge through novel view synthesis. OSRT scales to significantly more complex scenes with larger diversity of objects and backgrounds than existing methods. At the same time, it is multiple orders of magnitude faster at compositional rendering thanks to its light field parametrization and the novel Slot Mixer decoder. View details
    SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos
    Gamaleldin Fathy Elsayed
    Klaus Greff
    Michael Mozer
    Thomas Kipf
    Advances in Neural Information Processing Systems (2022), pp. 28940-28954
    Preview abstract The visual world can be parsimoniously characterized in terms of distinct entities with sparse interactions. Discovering this compositional structure in dynamic visual scenes has proven challenging for end-to-end computer vision approaches unless explicit instance-level supervision is provided. Slot-based models leveraging motion cues have recently shown great promise in learning to represent, segment, and track objects without direct supervision, but they still fail to scale to complex real-world multi-object videos. In an effort to bridge this gap, we take inspiration from human development and hypothesize that information about scene geometry in the form of depth signals can facilitate object-centric learning. We introduce SAVi++, an object-centric video model which is trained to predict depth signals from a slot-based video representation. By further leveraging best practices for model scaling, we are able to train SAVi++ to segment complex dynamic scenes recorded with moving cameras, containing both static and moving objects of diverse appearance on naturalistic backgrounds, without the need for segmentation supervision. Finally, we demonstrate that by using sparse depth signals obtained from LiDAR, SAVi++ is able to learn emergent object segmentation and tracking from videos in the real-world Waymo Open dataset. Project page: https://slot-attention-video.github.io/savi++/ View details
    Investigating object compositionality in Generative Adversarial Networks
    Karol Kurach
    Jürgen Schmidhuber
    Sylvain Gelly
    Neural Networks, 130 (2020), pp. 309-325
    Preview abstract Deep generative models seek to recover the process with which the observed data was generated. They may be used to synthesize new samples or to subsequently extract representations. Successful approaches in the domain of images are driven by several core inductive biases. However, a bias to account for the compositional way in which humans structure a visual scene in terms of objects has frequently been overlooked. In this work, we investigate object compositionality as an inductive bias for Generative Adversarial Networks (GANs). We present a minimal modification of a standard generator to incorporate this inductive bias and find that it reliably learns to generate images as compositions of objects. Using this general design as a backbone, we then propose two useful extensions to incorporate dependencies among objects and background. We extensively evaluate our approach on several multi-object image datasets and highlight the merits of incorporating structure for representation learning purposes. In particular, we find that our structured GANs are better at generating multi-object images that are more faithful to the reference distribution. More so, we demonstrate how, by leveraging the structure of the learned generative process, one can 'invert' the learned generative model to perform unsupervised instance segmentation. On the challenging CLEVR dataset, it is shown how our approach is able to improve over other recent purely unsupervised object-centric approaches to image generation. View details