- Julius Adebayo
- Justin Gilmer
- Ian Goodfellow
- Been Kim
Explaining the output of a complicated machine learning model like a deep neural network (DNN) is a central challenge in machine learning. Increasingly, explanations are required for debugging models, building trust prior to model deployment, and potentially identifying unwanted effects like model bias. Several methods have been proposed to address this issue. Local explanation methods provide explanations of the output of a model on a single input. Given the importance of these explanations to the use and deployment of these models, we ask: can we trust local explanations for DNNs created using current methods?
In particular, we seek to assess how specific local explanations are to the parameter values of DNNs. We compare explanations generated using a fully trained DNNs to explanations of DNNs with some or all parameters replaced by random values. Somewhat surprisingly, we find that, for several local explanation methods, explanations derived from networks with randomized weights and trained weights are both visually and quantitatively similar; in some cases, virtually indistinguishable. By randomizing different portions of the network, we find that local explanations are significantly reliant on lower level features of the DNN.