Adversarial Test Set for Image Classification: Lessons Learned from CATS4ML Data Challenge

Lora Mois Aroyo
Praveen Kumar Paritosh
NeurIPS 2021 Datasets and Benchmarks Track, NeurIPS(2021)
Google Scholar


A primary role of data in ML is to serve as benchmarks that allow us to measure progress. Often, items that are difficult and have natural ambiguity of real world context are relatively underrepresented in evaluation datasets and benchmarks. This absence of ambiguous real-world examples in evaluation undermines the ability to reliably test machine learning performance. This results in unknown unknowns of an ML model’s behaviour, which is a large risk in their deployment. We designed and ran a public data challenge to proactively discover unknown unknowns in state-of-the-art image classification models applied to the Open Images v6 dataset. In this paper, we describe the design and implementation of the AAAI HCOMP CATS4ML 2020 challenge. Participants in this challenge were incentivized to find images that are incorrectly classified by the ML models. We present a set of failure modes in the state-of-art image classification abstracted from the 13,000 submissions from this challenge. We present a black-swan benchmark test set based on this challenge.