The Reasonable Effectiveness of Diverse Evaluation Data

Lora Aroyo

Mark Díaz

Christopher Homan

Vinodkumar Prabhakaran

Alex Taylor

Ding Wang

Human Evaluation for Generative Models (HEGM) Workshop at NeurIPS2022

Download Google Scholar

Abstract

In this paper, we present findings from an semi-experimental exploration of rater diversity and its influence on safety annotations of conversations generated by humans talking to a generative AI-chat bot. We find significant differences in judgments produced by raters from different geographic regions and annotation platforms, and correlate these perspectives with demographic sub-groups. Our work helps define best practices in model development-- specifically human evaluation of generative models-- on the backdrop of growing work on sociotechnical AI evaluations.

Research Areas

Human-Computer Interaction and Visualization
Machine Intelligence

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

The Reasonable Effectiveness of Diverse Evaluation Data

Abstract

Research Areas

Meet the teams driving innovation

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

The Reasonable Effectiveness of Diverse Evaluation Data

Abstract

Research Areas

Meet the teams driving innovation

AI/ML Foundations  & Capabilities