Capturing Covertly Toxic Speech via Crowdsourcing
Abstract
We study the task of extracting covert or veiled toxicity labels from user comments. Prior research has highlighted the difficulty in creating language models that recognize nuanced toxicity such as microaggressions. Our investigations further underscore the difficulty in parsing such labels reliably from raters via crowdsourcing. We introduce an initial dataset, COVERTTOXICITY, which aims to identify such comments from a refined rater template, with rater associated categories. Finally, we fine-tune a comment-domain BERT model to classify covertly offensive comments and compare against existing baselines.