Critical Evaluation Gaps in Machine Learning Practice
Abstract
In order to build trust that a machine learned model is appropriate and responsible within a systems context involving technical and
human components, a broad range of factors typically need to be considered. However in practice model evaluations frequently focus
on only a narrow range of expected predictive behaviours. This paper examines the critical evaluation gap between the idealized
breadth of concerns and the observed narrow focus of actual evaluations. In doing so, we demonstrate which values are centered—and
which are marginalized—within the machine learning community. Through an empirical study of machine learning papers from
recent high profile conferences, we demonstrate the discipline’s general focus on a small set of evaluation methods. By considering the
mathematical formulations of evaluation metrics and the test datasets over which they are calculated, we draw attention to which
properties of models are centered in the field. This analysis also reveals an important gap: the properties of models which are frequently
neglected or sidelined during evaluation. By studying the structure of this gap, we demonstrate the machine learning discipline’s
implicit assumption of a range of commitments which have normative impacts; these include commitments to consequentialism,
abstractability from context, the quantifiability of impacts, the irrelevance of non-predictive features, and the equivalence of different
failure modes. Shedding light on these assumptions and commitments enables us to question their appropriateness for different ML
system contexts, and points the way towards more diverse and contextualized evaluation methodologies which can be used to more
robustly examine the trustworthiness of ML models.
human components, a broad range of factors typically need to be considered. However in practice model evaluations frequently focus
on only a narrow range of expected predictive behaviours. This paper examines the critical evaluation gap between the idealized
breadth of concerns and the observed narrow focus of actual evaluations. In doing so, we demonstrate which values are centered—and
which are marginalized—within the machine learning community. Through an empirical study of machine learning papers from
recent high profile conferences, we demonstrate the discipline’s general focus on a small set of evaluation methods. By considering the
mathematical formulations of evaluation metrics and the test datasets over which they are calculated, we draw attention to which
properties of models are centered in the field. This analysis also reveals an important gap: the properties of models which are frequently
neglected or sidelined during evaluation. By studying the structure of this gap, we demonstrate the machine learning discipline’s
implicit assumption of a range of commitments which have normative impacts; these include commitments to consequentialism,
abstractability from context, the quantifiability of impacts, the irrelevance of non-predictive features, and the equivalence of different
failure modes. Shedding light on these assumptions and commitments enables us to question their appropriateness for different ML
system contexts, and points the way towards more diverse and contextualized evaluation methodologies which can be used to more
robustly examine the trustworthiness of ML models.