Interpretability Illusions in the Generalization of Simplified Models

Dan Friedman

Andrew Lampinen

Lucas Dixon

Danqi Chen

Asma Ghandeharioun

ICML 2024 (2024)

Download Google Scholar

Abstract

A common method to study deep learning systems is to create simplified representations---for example, using singular value decomposition to visualize the model's hidden states in a lower dimensional space. This approach assumes that the simplified model is faithful to the original model. Here, we illustrate an important caveat to this assumption: even if a simplified representation of the model can accurately approximate the original model on the training set, it may fail to match its behavior out of distribution; the understanding developed from simplified representations may be an illusion. We illustrate this by training Transformer models on controlled datasets with systematic generalization splits, focusing on the Dyck balanced-parenthesis languages. We simplify these models using tools like dimensionality-reduction and clustering, and find clear patterns in the resulting representations. We then explicitly test how these simplified proxy models match the original models behavior on various out-of-distribution test sets. Generally, the simplified proxies are less faithful out of distribution. For example, in cases where the original model generalizes to novel structures or deeper depths, the simplified model may fail to generalize, or may generalize too well. We then show the generality of these results: even model simplifications that do not directly use data can be less faithful out of distribution, and other tasks can also yield generalization gaps. Our experiments raise questions about the extent to which mechanistic interpretations derived using tools like SVD can reliably predict what a model will do in novel situations.

Defining the technology of today and tomorrow.

Philosophy

People

Research areas

Foundational ML & Algorithms

Computing Systems & Quantum AI

Science, AI & Society

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Interpretability Illusions in the Generalization of Simplified Models

Abstract

Learn more about how we conduct our research