In order to build trust that a machine learned model is appropriate and responsible within a systems context involving technical and human components, a broad range of factors typically need to be considered. However in practice model evaluations frequently focus on only a narrow range of expected predictive behaviours. This paper examines the critical evaluation gap between the idealized breadth of concerns and the observed narrow focus of actual evaluations. In doing so, we demonstrate which values are centered—and which are marginalized—within the machine learning community. Through an empirical study of machine learning papers from recent high profile conferences, we demonstrate the discipline’s general focus on a small set of evaluation methods. By considering the mathematical formulations of evaluation metrics and the test datasets over which they are calculated, we draw attention to which properties of models are centered in the field. This analysis also reveals an important gap: the properties of models which are frequently neglected or sidelined during evaluation. By studying the structure of this gap, we demonstrate the machine learning discipline’s implicit assumption of a range of commitments which have normative impacts; these include commitments to consequentialism, abstractability from context, the quantifiability of impacts, the irrelevance of non-predictive features, and the equivalence of different failure modes. Shedding light on these assumptions and commitments enables us to question their appropriateness for different ML system contexts, and points the way towards more diverse and contextualized evaluation methodologies which can be used to more robustly examine the trustworthiness of ML models.