Who’s asking? User personas and the mechanics of latent misalignment
Abstract
Why do models respond to harmful queries in some cases but not others?
Despite significant investments in improving model safety, it has been shown that misaligned capabilities remain latent in safety-tuned models. In this work, we shed light on the mechanics of this phenomenon. First, we show that even when model generations are safe, harmful content persists in hidden representations, and this content can be extracted by decoding from earlier layers. Then, we show that whether the model divulges such content depends significantly on who it is talking to, which we refer to as user persona. We study both natural language prompting and activation steering as methods for manipulating inferred user persona and show that the latter is significantly more effective at bypassing safety filters. In fact, we find it is even more effective than direct attempts to control a model's refusal tendency.
This suggests when it comes to deciding whether to respond to harmful queries, the model is deeply biased with respect to user persona. We leverage the generative capabilities of the language model itself to investigate why certain personas break model safeguards, and discover that they enable the model to form more charitable interpretations of otherwise dangerous queries. Finally, we show that we can predict a persona’s effect on refusal given only the geometry of its steering vector.
Despite significant investments in improving model safety, it has been shown that misaligned capabilities remain latent in safety-tuned models. In this work, we shed light on the mechanics of this phenomenon. First, we show that even when model generations are safe, harmful content persists in hidden representations, and this content can be extracted by decoding from earlier layers. Then, we show that whether the model divulges such content depends significantly on who it is talking to, which we refer to as user persona. We study both natural language prompting and activation steering as methods for manipulating inferred user persona and show that the latter is significantly more effective at bypassing safety filters. In fact, we find it is even more effective than direct attempts to control a model's refusal tendency.
This suggests when it comes to deciding whether to respond to harmful queries, the model is deeply biased with respect to user persona. We leverage the generative capabilities of the language model itself to investigate why certain personas break model safeguards, and discover that they enable the model to form more charitable interpretations of otherwise dangerous queries. Finally, we show that we can predict a persona’s effect on refusal given only the geometry of its steering vector.