Jump to Content

Data Excellence: Better Data for Better AI

Lora Mois Aroyo
Google Scholar


Human annotated data plays a crucial role in the current ML/AI climate, where the human judgements are referenced as the ultimate source of truth. As such, human annotated data is a kind of compass for AI and research on Human Computation has a multiplicative effect on the AI field. Optimizing the cost, scale, and speed of data collection has been the center of the Human Computation research. What is less known is that such optimization is sometimes done at the cost of quality [Riezler, 2014]. Quality is evidently important but unfortunately poorly defined and rarely measured. A decade later, problems inherent to data are beginning to surface: fairness and bias [Goel and Faltings, 2019], quality issues [Crawford and Paglen, 2019], limitations of benchmarks [Kovaleva et al., 2019, Welty et al., 2019] reproducibility in ML research [Pineau et al., 2018, Gunderson and Kjensmo, 2018], lack of documentation [Katsuno et al., 2019]. Finally, in rushing to be first to market, aspects of data quality such as maintainability, reliability, validity, and fidelity are often overlooked. We want to turn this way of thinking on its head and highlight examples, case-studies, and methodologies for excellence in data collection. Currently, to the extent that it does, data excellence happens organically by virtue of individual expertise, diligence, commitment, pride, etc. This could be dangerous as we grow increasingly dependent on automation technologies, we don't want to be at the mercy of individual exemplars. Instead, we should codify data excellence in a systematic manner and raise the standards on the entire industry.