k-Rater Reliability: The Correct Unit of Reliability for Aggregated Human Annotations

Ka Leung Wong; Praveen Kumar Paritosh

k-Rater Reliability: The Correct Unit of Reliability for Aggregated Human Annotations

Ka Leung Wong

Praveen Kumar Paritosh

ACL (2022)

Google Scholar

Abstract

Since the inception of crowdsourcing, aggregation has been a common strategy for dealing with unreliable data. Aggregate ratings are more reliable than individual ones. However, many natural language processing (NLP) applications that rely on aggregate ratings only report the reliability of individual ratings, which is the incorrect unit of analysis. In these instances, the data reliability is under-reported, and a proposed k-rater reliability (kRR) should be used as the correct data reliability for aggregated datasets. It is a multi-rater generalization of inter-rater reliability (IRR). We conducted two replications of the WordSim-353 benchmark, and present empirical, analytical, and bootstrap-based methods for computing kRR on WordSim-353. These methods produce very similar results. We hope this discussion will nudge researchers to report kRR in addition to IRR.

Research Areas

Natural language processing

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

k-Rater Reliability: The Correct Unit of Reliability for Aggregated Human Annotations

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs