Mitigating Unwanted Biases with Adversarial Learning
Abstract
Machine learning can be used to train a model that accurately represents
the data on which it is trained. The most common loss functions
minimized by gradient descent involve accuracy. However,
modeling the training data optimally requires accurately modeling
any undesirable biases present in that training data. One task
which easily demonstrates this phenomenon is word embeddings
learned from standard corpora. When such word embeddings are
used to perform tasks like analogy completion, the bias in the word
embeddings propagates to the predicted analogy completions. Ideally
we would like to remove the biased information which might
impact task performance while retaining as much other semantic
information as possible. We present here a method for debiasing
networks using an adversary. First we formalize this problem by
describing the nature of the input to our network X, describing
the prediction which is desired Y and the protected variable Z. The
objective then becomes to maximize the primary network’s ability
to predict Y while minimizing the adversary’s ability to use that
prediction to predict Z. When applied to analogy completion this
method results in embeddings which are still quite useful for performing
analogy completion but without producing predictions
impacted by bias prediction. When applied to a categorization task
such as the one in the UCI Adult Dataset it results in a predictive
model that maintains accuracy while ensuring equality of odds.
This method is quite flexible and is applicable to any problem set
which is expressible as a model which predicts a label Y using an
input X while trying to be fair with respect to a protected variable
Z.
the data on which it is trained. The most common loss functions
minimized by gradient descent involve accuracy. However,
modeling the training data optimally requires accurately modeling
any undesirable biases present in that training data. One task
which easily demonstrates this phenomenon is word embeddings
learned from standard corpora. When such word embeddings are
used to perform tasks like analogy completion, the bias in the word
embeddings propagates to the predicted analogy completions. Ideally
we would like to remove the biased information which might
impact task performance while retaining as much other semantic
information as possible. We present here a method for debiasing
networks using an adversary. First we formalize this problem by
describing the nature of the input to our network X, describing
the prediction which is desired Y and the protected variable Z. The
objective then becomes to maximize the primary network’s ability
to predict Y while minimizing the adversary’s ability to use that
prediction to predict Z. When applied to analogy completion this
method results in embeddings which are still quite useful for performing
analogy completion but without producing predictions
impacted by bias prediction. When applied to a categorization task
such as the one in the UCI Adult Dataset it results in a predictive
model that maintains accuracy while ensuring equality of odds.
This method is quite flexible and is applicable to any problem set
which is expressible as a model which predicts a label Y using an
input X while trying to be fair with respect to a protected variable
Z.