Fairwashing Explanations with Off-Manifold Detergent

Christopher J. Anders
Plamen Pasliev
Ann-Kathrin Dombrowski
Pan Kessel
International Conference on Machine Learning, PMLR (2020), pp. 314-323

Abstract

Explanation methods promise to make black-box
classifiers more transparent. As a result, it is
hoped that they can act as proof for a sensible,
fair and trustworthy decision-making process of
the algorithm and thereby increase its acceptance
by the end-users. In this paper, we show both theoretically and experimentally that these hopes are
presently unfounded. Specifically, we show that,
for any classifier g, one can always construct another classifier g˜ which has the same behavior on
the data (same train, validation, and test error) but
has arbitrarily manipulated explanation maps. We
derive this statement theoretically using differential geometry and demonstrate it experimentally
for various explanation methods, architectures,
and datasets. Motivated by our theoretical insights, we then propose a modification of existing
explanation methods which makes them significantly more robust.