Data Excellence for AI: Why Should You Care

Lora Mois Aroyo
Matt Lease
Praveen Kumar Paritosh
ACM IX Interactions(2022)
Google Scholar


The efficacy of machine learning (ML) models depends on both algorithms and data. Training data defines what we want our models to learn, and testing data provides the means by which empirical progress is measured. Benchmark datasets such as SQuAD, GLUE, and ImageNet define the entire world within which models exist and operate, yet research continues to focus on critiquing and improving the models, e.g., via shared-task challenges or Kaggle contests, rather than critiquing and improving the data environment in which our models operate. Research and community challenges focused on improving the data itself are relatively rare. If “data is the new oil,” our use of data remains crude today, and we are missing work on the refineries by which the data itself could be optimized for more effective use. Important scientific opportunities and value are being neglected [Schaekermann et al., 2020]. Data is potentially the most under-valued and de-glamorised aspect of today’s AI ecosystem. Data issues are often perceived and characterized as mundane and rote, the “pre-processing” that has to be done before the real (modeling) work can be done. For example, Kandel et al. (2012) emphasize that ML practitioners view data wrangling as tedious and time-consuming. However, Sambasivan et al. (2021) provide examples of how data quality is crucial to ensure that AI systems can accurately represent and predict the phenomenon it is claiming to measure. They introduce four classes of Data Cascades: compounding events causing negative, downstream effects from data issues triggered by conventional AI/ML practices that undervalue data quality. This emphasizes the significance of data due to its downstream impact on user wellbeing and societal effects. Real-world datasets are often ‘dirty’, with various data quality problems (Northcutt et al, 2021), with the risk of “garbage in = garbage out” in terms of the downstream AI systems we train and test on such data. This has inspired a steadily growing body of work on understanding and improving data quality (Chu, et al, 2013; Krishnan, et al, 2016; Redman, et al, 2018; Raman et al, 2001). It also highlights the importance of rigorously managing data quality using mechanisms specific to data validation, instead of relying on model performance as a proxy for data quality (Thomas, et al, 2020). Just as we rigorously test our code for software defects before deployment, we might test for data defects with the same degree of rigor, so that we might detect, prevent, or mitigate weaknesses in ML models caused by underlying issues in data quality. The “Crowdsourcing Adverse Test Sets for Machine Learning (CATS4ML)” Data Challenge (Aroyo and Paritosh, 2021) aims to raise the bar in ML evaluation sets and to find as many examples as possible that are confusing or otherwise problematic for algorithms to process. Similarly to (Vandenhof, 2019) CATS4ML relies on people’s abilities and intuition to spot new data examples about which machine learning is confident, but actually misclassified. This research is inspired by (Attenberg et al, 2015) following the claim “Humans should always be part of machine learning solutions, as they can guide machine learning systems to learn about things that the systems don't yet know — the “unknown unknowns.”” by Iperiotis, (2016). Many benchmark datasets contain instances that are relatively easy (e.g., photos with a subject that is easy to identify). In so doing, they miss the natural ambiguity of the real world in which our models are to be actually applied. Data instances with annotator disagreement are often aggregated to eliminate disagreement (obscuring uncertainty), or filtered out of datasets entirely. Exclusion of difficult and/or ambiguous real-world examples in evaluation risks “toy dataset” benchmarks that diverge from the real data to be encountered in practice. Successful benchmark models fail to generalize to real data, and inflated benchmark results mislead our assessment of state-of-the-art capabilities. ML models become prone to develop “weak spots”, i.e., classes of examples that are difficult or impossible for a model to accurately evaluate, because that class of examples is missing from the evaluation set. Measuring data quality is challenging, nebulous, and often circularly defined, with annotated data defining the “ground truth” on which models are trained and tested [Riezler, 2014]. When dataset quality is considered, the ways in which it is measured in practice is often poorly understood and sometimes simply wrong. Challenges identified include fairness and bias issues in labeled datasets [Goel and Faltings, 2019], quality issues in datasets [Crawford and Paglen, 2019], limitations of benchmarks [Kovaleva et al., 2019, Welty et al., 2019] reproducibility concerns in machine learning research [Pineau et al., 2018, Gunderson and Kjensmo, 2018], lack of documentation and replication of data [Katsuno et al., 2019]. Measurement of AI success and progress today is often metrics-driven, with emphasis on rigorous measurement and A/B testing. However, measuring goodness of the fit of the model to the dataset completely ignores any consideration of how well the dataset fits the real world problem to be solved and its data. Goodness-of-fit metrics, such as F1, Accuracy, AUC, do not tell us much about data fidelity (i.e., how well the dataset represents reality) and validity (how well the data explains things related to the phenomena captured by the data). No standardised metrics exist today for characterising the goodness-of-data [11,13]. Research on metrics is emerging [15,91] but is not yet widely known, accepted, or applied in the AI ecosystem today. As a result, there is an overreliance on goodness-of-fit metrics and post-deployment product metrics. Focusing on fidelity and validity of data will further increase its scientific value and reusability. Such research is necessary for enabling better incentives for data, as it is hard to improve something we can not measure. Researchers in human computation (HCOMP) and various ML-related fields have demonstrated a longstanding interest in applying crowdsourcing approaches to generate human-annotated data for model training and testing [25,128]. A series of workshops (Meta-Eval 2020 @ AAAI, REAIS 2019 @ HCOMP, SAD 2019 @ TheWebConf (WWW), SAD 2018 @ HCOMP) have helped increase further awareness about the issues of data quality for ML evaluation and provide a venue for scholarship on this subject. Because human annotated data represents the compass that the entire ML community relies on, data-focused research, by the HCOMP community and others, can potentially have a multiplicative effect on accelerating progress in ML more broadly. Optimizing the cost, size, and speed of collecting data has attracted significant attention in the first-to-market rush with data. However, aspects of maintainability, reliability, validity, and fidelity of datasets have been often overlooked. We argue we have now reached an inflection point in the field of ML in which attention to neglected data quality is poised to significantly accelerate progress. Toward this end, we advocate for research defining and creating processes to achieve data excellence. We highlight examples, case-studies, and methodologies. This will enable the necessary change in our research culture to value excellence in data practices, which is a critical milestone on the road to enabling the next generation of breakthroughs in ML and AI.