Interpreting Neural Network Judgments via Minimal, Stable, and Symbolic Corrections
Abstract
We present a new algorithm to generate minimal, stable, and symbolic corrections to
an input that will cause a neural network with ReLU activations to change its output.
We argue that such a correction is a useful way to provide feedback to a user when
the network’s output is different from a desired output. Our algorithm generates
such a correction by solving a series of linear constraint satisfaction problems. The
technique is evaluated on three neural network models: one predicting whether an
applicant will pay a mortgage, one predicting whether a first-order theorem can
be proved efficiently by a solver using certain heuristics, and the final one judging
whether a drawing is an accurate rendition of a canonical drawing of a cat.