Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models
Abstract
Inspecting hidden representations of large language models (LLM) is of growing interest. Not only to understand a model's behavior and verify its alignment with human values, but also to control it before it goes awry. Given the capabilities of LLMs in generating coherent, human-understandable text, we propose leveraging the model itself to explain its internal representations in natural language. We introduce a framework called Patchscopes and show how it can be used to answer various kinds of questions, which we refer in the singular, as a specific Patchscope. We show that many prior inspection methods based on projecting the representations into the vocabulary space, such as logit lens, tuned lens, and linear shortcuts, can be viewed as special instances of this framework. Moreover, several of their shortcomings such as failure in inspecting early layers or lack of expressivity can be mitigated by a Patchscope. Beyond unifying prior inspection techniques, Patchscopes also opens up new possibilities, such as using a more capable model to explain the representations of a smaller model. Finally, we demonstrate the utility of Patchscopes for practical applications, such as harmful belief extraction and self-correction in multi-hop reasoning.