Croissant: A Metadata Format for ML-Ready Datasets

Mubashara Akhtar
Omar Benjelloun
Costanza Conforti
Luca Foschini
Joan Giner-Miguelez
Pieter Gijsbers
Sujata Goswami
Nitisha Jain
Michalis Karamousadakis
Michael Kuchnik
Satyapriya Krishna
Sylvain Lesage
Quentin Lhoest
Pierre Marcenac
Manil Maskey
Peter Mattson
Luis Oala
Hamidah Oderinwale
Pierre Ruyssen
Tim Santos
Rajat Shinde
Elena Simperl
Arjun Suresh
Goeff Thomas
Vyacheslav Tykhonov
Joaquin Vanschoren
Susheel Varma
Jos Van Der Velde
Carole Jean Wu
Steffen Vogler
Luyao Zhang
2024

Abstract

Data is a critical resource for machine learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that creates a shared representation across ML tools, frameworks, and platforms. Croissant makes datasets more discoverable, portable, and interoperable, thereby addressing significant challenges in ML data management. Croissant is already supported by several popular dataset repositories, spanning hundreds of thousands of datasets, enabling easy loading into the most commonly-used ML frameworks, regardless of where the data is stored. Our initial evaluation by human raters shows that Croissant metadata is readable, understandable, complete, yet concise.
×