Measuring and Improving Model-Moderator Collaboration using Uncertainty Estimation
Abstract
Content moderation is often performed by a collaboration between humans and machine learning models. The machine learning models used in this collaboration are typically evaluated using metrics like accuracy or AUROC. However, such metrics do not capture the performance of the combined moderator-model system. Here, we introduce metrics analogous to accuracy and AUC that describe the overall system performance under constraints on human review bandwidth, and that quantify how efficiently and effectively these systems make use of human decision-making. We evaluate the performance of several models using these new metrics as well as existing ones under different review policies (the order in which moderators review comments from the model), finding that simple uncertainty-based review policies outperform traditional toxicity-based ones across a range of human bandwidths. Our results demonstrate the importance of metrics capturing the collaborative nature of the moderator-model system for this task, as well as the utility of uncertainty estimation for the content moderation problem.