Toward Deconfounding the Influence of Entity Demographics for Question Answering Accuracy
Abstract
Question Answering (QA) tasks are used as benchmarks of general machine intelligence. Therefore, robust QA evaluation is critical, and metrics should indicate how models will answer _any_ question. However, major QA datasets have skewed distributions over gender, profession, and nationality. Despite that skew, models generalize---we find little evidence that accuracy is lower for people based on gender or nationality. Instead, there is more variation in question topic and question ambiguity. Adequately accessing the generalization of \abr{qa} systems requires more representative datasets.