Interpretability Illusions in the Generalization of Simplified Models

Dan Friedman
Andrew Lampinen
Danqi Chen
ICML 2024 (2024)

Abstract

A common method to study deep learning systems is to create simplified representations---for example, using singular value decomposition to visualize the model's hidden states in a lower dimensional space. This approach assumes that the simplified model is faithful to the original model. Here, we illustrate an important caveat to this assumption: even if a simplified representation of the model can accurately approximate the original model on the training set, it may fail to match its behavior out of distribution; the understanding developed from simplified representations may be an illusion. We illustrate this by training Transformer models on controlled datasets with systematic generalization splits, focusing on the Dyck balanced-parenthesis languages. We simplify these models using tools like dimensionality-reduction and clustering, and find clear patterns in the resulting representations. We then explicitly test how these simplified proxy models match the original models behavior on various out-of-distribution test sets. Generally, the simplified proxies are less faithful out of distribution. For example, in cases where the original model generalizes to novel structures or deeper depths, the simplified model may fail to generalize, or may generalize too well. We then show the generality of these results: even model simplifications that do not directly use data can be less faithful out of distribution, and other tasks can also yield generalization gaps. Our experiments raise questions about the extent to which mechanistic interpretations derived using tools like SVD can reliably predict what a model will do in novel situations.
×