Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research

Bernard Koch; Emily Denton; Alex Hanna; Jacob Foster

Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research

Bernard Koch

Emily Denton

Alex Hanna

Jacob Foster

NeurIPS Dataset & Benchmark track (2021)

Google Scholar

Abstract

Datasets form the backbone of machine learning research. They are deeply integrated into work practices of machine learning researchers, operating as resources for training and testing machine learning models. Moreover, datasets serve a central role in the organization of machine learning as a scientific field. Benchmark datasets formalize tasks and coordinate scientists around shared research problems. Advancement on these benchmarks is considered a key signal for collective progress, and is thus also an important form of social capital to motivate and evaluate individual researchers. Given their central organizing role, datasets have also become a central object of critical inquiry in recent years. For example, dataset audits have revealed pervasive biases, studies of disciplinary norms of dataset development have relieved concerning practices relating to dataset development and dissemination, and a host of concerns relating to benchmarking practices have also emerged in recent years calling into question the validity of measurements. However, comparatively little attention has been paid to the dynamics of dataset use within and across machine learning subcommunities. In this work we dig into these dynamics, by studying how dataset usage patterns differ across different machine learning subcommunities and across time from 2014-2021.

Research Areas

Responsible AI

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs