Croissant: A Metadata Format for ML-Ready Datasets

Mubashara Akhtar; Omar Benjelloun; Costanza Conforti; Luca Foschini; Joan Giner-Miguelez; Pieter Gijsbers; Sujata Goswami; Nitisha Jain; Michalis Karamousadakis; Michael Kuchnik; Satyapriya Krishna; Sylvain Lesage; Quentin Lhoest; Pierre Marcenac; Manil Maskey; Peter Mattson; Luis Oala; Hamidah Oderinwale; Pierre Ruyssen; Tim Santos; Rajat Shinde; Elena Simperl; Arjun Suresh; Goeff Thomas; Vyacheslav Tykhonov; Joaquin Vanschoren; Susheel Varma; Jos Van Der Velde; Carole Jean Wu; Steffen Vogler; Luyao Zhang

Croissant: A Metadata Format for ML-Ready Datasets

Mubashara Akhtar

Omar Benjelloun

Costanza Conforti

Luca Foschini

Joan Giner-Miguelez

Pieter Gijsbers

Sujata Goswami

Nitisha Jain

Michalis Karamousadakis

Michael Kuchnik

Satyapriya Krishna

Sylvain Lesage

Quentin Lhoest

Pierre Marcenac

Manil Maskey

Peter Mattson

Luis Oala

Hamidah Oderinwale

Pierre Ruyssen

Tim Santos

Rajat Shinde

Elena Simperl

Arjun Suresh

Goeff Thomas

Vyacheslav Tykhonov

Joaquin Vanschoren

Susheel Varma

Jos Van Der Velde

Carole Jean Wu

Steffen Vogler

Luyao Zhang

2024

Download Google Scholar

Abstract

Data is a critical resource for machine learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that creates a shared representation across ML tools, frameworks, and platforms. Croissant makes datasets more discoverable, portable, and interoperable, thereby addressing significant challenges in ML data management. Croissant is already supported by several popular dataset repositories, spanning hundreds of thousands of datasets, enabling easy loading into the most commonly-used ML frameworks, regardless of where the data is stored. Our initial evaluation by human raters shows that Croissant metadata is readable, understandable, complete, yet concise.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Croissant: A Metadata Format for ML-Ready Datasets

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs