Jump to Content

Kubric: A scalable dataset generator

Anissa Yuenming Mak
Austin Stone
Carl Doersch
Cengiz Oztireli
Charles Herrmann
Daniel Rebain
Derek Nowrouzezahrai
Dmitry Lagun
Fangcheng Zhong
Florian Golemo
Francois Belletti
Henning Meyer
Hsueh-Ti (Derek) Liu
Issam Laradji
Klaus Greff
Kwang Moo Yi
Matan Sela
Noha Radwan
Thomas Kipf
Tianhao Wu
Vincent Sitzmann
Yilun Du
Yishu Miao
(2022)
Google Scholar

Abstract

Data is the driving force of machine learning. The amount and quality of training data is often more important for the performance of a system than the details of its architecture. Data is also an important tool for testing specific hypothesis, and for empirically evaluating the behaviour of complex systems. Synthetic data generation represents a powerful tool that can address all these shortcomings: 1) it is cheap 2) supports rich ground-truth annotations 3) offers full control over data and 4) can circumvent privacy and legal concerns. Unfortunately the toolchain for generating data is less well developed than that for building models. We aim to improve this situation by introducing Kubric: a scalable open-source pipeline for generating realistic image and video data with rich ground truth annotations. We also publish a collection of generated datasets and baseline results on several vision tasks.

Research Areas