On the limits of evaluation
Abstract
As Generative AI (GenAI) systems increasingly enter our daily lives, reshaping social norms and practices, we must examine the norms and practices we use to evaluate the systems themselves. Recent scholarship has started to make explicit the normative dimensions of Machine Learning (ML) development and evaluation. \citet{birhane2022values} demonstrate that particular normative values are encoded in Machine Learning (ML) practice. \citet{hutchinson2022evaluation}, in a review of ML evaluation practices, identify several commitments implicit in the way ML models are evaluated. These include a commitment to consequentialism, the assumptions that evaluations can be undertaken acontextually and that model inputs need only play a limited during model evaluation, and the expectations that impacts can be quantified and that ML failure modes are commensurable. In this provocation, we extend this line of inquiry by arguing two points: we need to attend to the implicit assumptions and values reflected in how societal impacts are conceptualised and constructed through ML evaluations; and doing so reveals that many of the problems that societal impact evaluations attempt to address would be better conceptualised as governance, rather than evaluation, issues.