Object Scene Representation Transformer

Mehdi S. M. Sajjadi; Daniel Duckworth; Aravindh Mahendran; Sjoerd van Steenkiste; Filip Pavetić; Mario Lučić; Leonidas Guibas; Klaus Greff; Thomas Kipf

Object Scene Representation Transformer

Mehdi S. M. Sajjadi

Daniel Duckworth

Aravindh Mahendran

Sjoerd van Steenkiste

Filip Pavetić

Mario Lučić

Leonidas Guibas

Klaus Greff

Thomas Kipf

Advances in Neural Information Processing Systems (2022), pp. 9512-9524

Download Google Scholar

Abstract

A compositional understanding of the world in terms of objects and their geometry in 3D space is considered a cornerstone of human cognition. Facilitating the learning of such a representation in neural networks holds promise for substantially improving labeled data efficiency. As a key step in this direction, we make progress on the problem of learning 3D-consistent decompositions of complex scenes into individual objects in an unsupervised fashion. We introduce Object Scene Representation Transformer (OSRT), a 3D-centric model in which individual object representations naturally emerge through novel view synthesis. OSRT scales to significantly more complex scenes with larger diversity of objects and backgrounds than existing methods. At the same time, it is multiple orders of magnitude faster at compositional rendering thanks to its light field parametrization and the novel Slot Mixer decoder.

Research Areas

Machine perception

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Object Scene Representation Transformer

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs