The Reasonable Effectiveness of Diverse Evaluation Data

Lora Aroyo
Christopher Homan
Alex Taylor
Human Evaluation for Generative Models (HEGM) Workshop at NeurIPS2022


In this paper, we present findings from an semi-experimental exploration of rater diversity and its influence on safety annotations of conversations generated by humans talking to a generative AI-chat bot. We find significant differences in judgments produced by raters from different geographic regions and annotation platforms, and correlate these perspectives with demographic sub-groups. Our work helps define best practices in model development-- specifically human evaluation of generative models-- on the backdrop of growing work on sociotechnical AI evaluations.