Jump to Content

Mitigating Unwanted Biases with Adversarial Learning

Blake Lemoine
Brian Zhang
M. Mitchell
ACM (2018)


Machine learning can be used to train a model that accurately represents the data on which it is trained. The most common loss functions minimized by gradient descent involve accuracy. However, modeling the training data optimally requires accurately modeling any undesirable biases present in that training data. One task which easily demonstrates this phenomenon is word embeddings learned from standard corpora. When such word embeddings are used to perform tasks like analogy completion, the bias in the word embeddings propagates to the predicted analogy completions. Ideally we would like to remove the biased information which might impact task performance while retaining as much other semantic information as possible. We present here a method for debiasing networks using an adversary. First we formalize this problem by describing the nature of the input to our network X, describing the prediction which is desired Y and the protected variable Z. The objective then becomes to maximize the primary network’s ability to predict Y while minimizing the adversary’s ability to use that prediction to predict Z. When applied to analogy completion this method results in embeddings which are still quite useful for performing analogy completion but without producing predictions impacted by bias prediction. When applied to a categorization task such as the one in the UCI Adult Dataset it results in a predictive model that maintains accuracy while ensuring equality of odds. This method is quite flexible and is applicable to any problem set which is expressible as a model which predicts a label Y using an input X while trying to be fair with respect to a protected variable Z.