Follow the leader(board) with confidence: Estimating p-values from a single test set with item and response variance
Abstract
We tackle the problem of providing accurate, rigorous p-values
for comparisons between the results of two evaluated systems
whose evaluations are based on a crowdsourced “gold” reference
standard. While this problem has been studied before, we argue
that the null hypotheses used in previous work have been based
on a common fallacy of equality of probabilities, as opposed to the
standard null hypothesis that two sets are drawn from the same
distribution. We propose using the standard null hypothesis, that
two systems’ responses are drawn from the same distribution, and
introduce a simulation-based framework for determining the true
p-value for this null hypothesis. We explore how to estimate the
true p-value from a single test set under different metrics, tests,
and sampling methods, and call particular attention to the role of
response variance, which exists in crowdsourced annotations as a
product of genuine disagreement, and in system predictions as a
product of stochastic training regimes, or in generative models as
an expected property of the outputs. We find that response variance
is a powerful tool for estimating p-values, and present results for
the metrics, tests, and sampling methods that make the best p-value
estimates in a simple machine learning model comparison
for comparisons between the results of two evaluated systems
whose evaluations are based on a crowdsourced “gold” reference
standard. While this problem has been studied before, we argue
that the null hypotheses used in previous work have been based
on a common fallacy of equality of probabilities, as opposed to the
standard null hypothesis that two sets are drawn from the same
distribution. We propose using the standard null hypothesis, that
two systems’ responses are drawn from the same distribution, and
introduce a simulation-based framework for determining the true
p-value for this null hypothesis. We explore how to estimate the
true p-value from a single test set under different metrics, tests,
and sampling methods, and call particular attention to the role of
response variance, which exists in crowdsourced annotations as a
product of genuine disagreement, and in system predictions as a
product of stochastic training regimes, or in generative models as
an expected property of the outputs. We find that response variance
is a powerful tool for estimating p-values, and present results for
the metrics, tests, and sampling methods that make the best p-value
estimates in a simple machine learning model comparison