Interpreting Social Respect: A Normative Lens for ML Models
Abstract
Machine learning is often viewed as an inherently value-neutral process:
statistical tendencies in the training inputs are ``simply''
used to generalize to new examples. However when models impact social
systems such as interactions between humans, these patterns learned by models
have normative implications. It is important that we ask not only ``what
patterns exist in the data?'', but also ``how do we want our system
to impact people?'' In particular, because minority and marginalized
members of society are often statistically underrepresented in data sets, models
may have undesirable disparate impact on such groups. As such, objectives of
social equity and distributive justice require that we develop tools for both
identifying and interpreting harms introduced by models.
This paper directly addresses the challenge of interpreting how
human
values are implicitly encoded by deep neural networks, a machine learning
paradigm often seen as inscrutable. Doing so requires understanding how the node
activations of neural networks relate to value-laden human concepts
such as {\sc respectful} and {\sc abusive}, as well as to concepts
about human social identities such as {\sc gay}, {\sc straight},
{\sc male}, {\sc female}, etc. To do this, we present the first application
of Testing with Concept Activation Vectors ({\sc tcav}; \cite{kim2018interpretability})
to models for analyzing human language.