Do Datasets Have Politics? Disciplinary Values in Computer Vision Dataset Development

Morgan Klaus Scheuerman
Alex Hanna
The 24th ACM Conference on Computer-Supported Cooperative Work and Social Computing (2021)
Google Scholar

Abstract

Data is a crucial component of machine learning; a model is reliant on data to train, validate, and test it. With increased technical capabilities, machine learning research has boomed in both academic and industry settings---and one major focus has been on computer vision. Computer vision is a popular domain of machine learning increasingly pertinent to real world applications, from facial recognition in policing to object detection for autonomous vehicles. Given computer vision’s propensity to shape machine learning research practices and impact human life, we sought to understand disciplinary practices around dataset documentation---how data is collected, curated, annotated, and packaged into datasets for computer vision researchers and practitioners to use for model tuning and development. Specifically, we examined what dataset documentation communicates about the underlying values of vision data and the larger practices and goals of computer vision as a field. To conduct this study, we collected a large corpus of computer vision datasets, from which we sampled 114 databases across different vision tasks. We document a number of values around accepted data practices, what makes desirable data, and the treatment of humans in the dataset construction process. We discuss how computer vision database authors value efficiency at the expense of care; universality at the expense of contextuality; impartiality at the expense of positionality; and model work at the expense of data work. Many of the silenced values we identified sit in opposition with human-centered data practices, which we reference in our suggestions for better incorporating silenced values into the dataset curation process.