Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research
Abstract
Datasets form the backbone of machine learning research. They are deeply integrated into work practices of machine learning researchers, operating as resources for training and testing machine learning models. Moreover, datasets serve a central role in the organization of machine learning as a scientific field. Benchmark datasets formalize tasks and coordinate scientists around shared research problems. Advancement on these benchmarks is considered a key signal for collective progress, and is thus also an important form of social capital to motivate and evaluate individual researchers. Given their central organizing role, datasets have also become a central object of critical inquiry in recent years. For example, dataset audits have revealed pervasive biases, studies of disciplinary norms of dataset development have relieved concerning practices relating to dataset development and dissemination, and a host of concerns relating to benchmarking practices have also emerged in recent years calling into question the validity of measurements. However, comparatively little attention has been paid to the dynamics of dataset use within and across machine learning subcommunities. In this work we dig into these dynamics, by studying how dataset usage patterns differ across different machine learning subcommunities and across time from 2014-2021.