Data and its (dis)contents: A survey of dataset development and use in machine learning research

Amandalynne Paullada; Inioluwa Deborah Raji; Emily Bender; Emily Denton; Alex Hanna

Data and its (dis)contents: A survey of dataset development and use in machine learning research

Amandalynne Paullada

Inioluwa Deborah Raji

Emily Bender

Emily Denton

Alex Hanna

Patterns (2021)

Download Google Scholar

Abstract

Datasets form the basis for training, evaluating, and benchmarking machine learning models and have played a foundational role in the advancement of the field. Furthermore, the ways in which we collect, construct, and share these datasets inform the kinds of problems the field pursues and the methods explored in algorithm development. In this work, we survey recent issues pertaining to data in machine learning research, focusing primarily on work in computer vision and natural language processing. We summarize concerns relating to the design, collection, maintenance, distribution, and use of machine learning datasets as well as broader disciplinary norms and cultures that pervade the field. We advocate a turn in the culture toward more careful practices of development, maintenance, and distribution of datasets that are attentive to limitations and societal impact while respecting the intellectual property and privacy rights of data creators and data subjects.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Data and its (dis)contents: A survey of dataset development and use in machine learning research

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs