Jump to Content

Interpreting Social Respect: A Normative Lens for ML Models

KJ Pittl
M. Mitchell
(2019)

Abstract

Machine learning is often viewed as an inherently value-neutral process: statistical tendencies in the training inputs are ``simply'' used to generalize to new examples. However when models impact social systems such as interactions between humans, these patterns learned by models have normative implications. It is important that we ask not only ``what patterns exist in the data?'', but also ``how do we want our system to impact people?'' In particular, because minority and marginalized members of society are often statistically underrepresented in data sets, models may have undesirable disparate impact on such groups. As such, objectives of social equity and distributive justice require that we develop tools for both identifying and interpreting harms introduced by models. This paper directly addresses the challenge of interpreting how human values are implicitly encoded by deep neural networks, a machine learning paradigm often seen as inscrutable. Doing so requires understanding how the node activations of neural networks relate to value-laden human concepts such as {\sc respectful} and {\sc abusive}, as well as to concepts about human social identities such as {\sc gay}, {\sc straight}, {\sc male}, {\sc female}, etc. To do this, we present the first application of Testing with Concept Activation Vectors ({\sc tcav}; \cite{kim2018interpretability}) to models for analyzing human language.