"Everyone wants to do the model work, not the data work": Data Cascades in High-Stakes AI

Nithya Sambasivan; Shivani Kapania; Hannah Highfill; Diana Akrong; Praveen Kumar Paritosh; Lora Mois Aroyo

"Everyone wants to do the model work, not the data work": Data Cascades in High-Stakes AI

Nithya Sambasivan

Shivani Kapania

Hannah Highfill

Diana Akrong

Praveen Kumar Paritosh

Lora Mois Aroyo

SIGCHI, ACM (2021)

Download Google Scholar

Abstract

AI models are increasingly applied in high-stakes domains like health and conservation. Data quality carries an elevated significance in high-stakes AI due to its heightened downstream impact, impacting predictions like cancer detection, wildlife poaching, and loan allocations. Paradoxically, data is the most under-valued and de-glamorised aspect of AI. In this paper, we report on data practices in high-stakes AI, from interviews with 53 AI practitioners in India, East and West African countries, and USA. We define, identify, and present empirical evidence on Data Cascades---compounding events causing negative, downstream effects from data issues---triggered by conventional AI/ML practices that undervalue data quality. Data cascades are pervasive (92% prevalence), invisible, delayed, but often avoidable. We discuss HCI opportunities in designing and incentivizing data excellence as a first-class citizen of AI, resulting in safer and more robust systems for all.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

"Everyone wants to do the model work, not the data work": Data Cascades in High-Stakes AI

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs