Lucas Dixon

Lucas Dixon

Lucas is a principal research scientist in Google DeepMind and co-lead of PAIR (People and AI Research). He works on visualisation, interpretability and control of machine learning systems, and specifically language models. His work explores how people can productively and fairly benefit from machine learning systems.

Previously, he was Chief Scientist at Jigsaw where he founded engineering and research. He has contributed scientific advances and systems in multiple disciplines including digital security, formal logic, machine learning, and data visualization. For example he co-founded uProxy & Outline, Project Shield, DigitalAttackMap; Syria Defection Tracker, unfiltered.news, Conversation AI and Perspective API.

Before Google, Lucas completed his PhD and worked at the University of Edinburgh on the automation of mathematical reasoning and graphical languages applied to quantum information. He also helped run a non-profit working towards more rational and informed discussion and decision making, and was a co-founder of TheoryMine - a playful take on automating mathematical discovery. Outside of scientific advances, Lucas is also a martial arts instructor in Paris.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Why do models respond to harmful queries in some cases but not others? Despite significant investments in improving model safety, it has been shown that misaligned capabilities remain latent in safety-tuned models. In this work, we shed light on the mechanics of this phenomenon. First, we show that even when model generations are safe, harmful content persists in hidden representations, and this content can be extracted by decoding from earlier layers. Then, we show that whether the model divulges such content depends significantly on who it is talking to, which we refer to as user persona. We study both natural language prompting and activation steering as methods for manipulating inferred user persona and show that the latter is significantly more effective at bypassing safety filters. In fact, we find it is even more effective than direct attempts to control a model's refusal tendency. This suggests when it comes to deciding whether to respond to harmful queries, the model is deeply biased with respect to user persona. We leverage the generative capabilities of the language model itself to investigate why certain personas break model safeguards, and discover that they enable the model to form more charitable interpretations of otherwise dangerous queries. Finally, we show that we can predict a persona’s effect on refusal given only the geometry of its steering vector. View details
    Parameter Efficient Reinforcement Learning from Human Feedback
    Hakim Sidahmed
    Alex Hutcheson
    Zhuonan Lin
    Zhang Chen
    Zac Yu
    Jarvis Jin
    Simral Chaudhary
    Roman Komarytsia
    Christiane Ahlheim
    Yonghao Zhu
    Bowen Li
    Jessica Hoffmann
    Hassan Mansoor
    Wei Li
    Abhinav Rastogi
    2024
    Preview abstract While Reinforcement Learning from Human Feedback (RLHF) effectively aligns pretrained Large Language Models (LLMs) with human preferences, its computational cost and complexity hinder wider adoption. This work introduces Parameter-Efficient Reinforcement Learning (PERL): by leveraging Low-Rank Adaptation (LoRA) \citep{hu2021lora} for reward model training and reinforcement learning, we are able to perform RL loops while updating only a fraction of the parameters required by traditional RLHF. We demonstrate that the effectiveness of this method is not confined to a specific task. We compare PERL to conventional fine-tuning (full-tuning) across X highly diverse tasks, spanning from summarization to X and X, for a total of X different benchmarks - including two novel preference datasets released with this paper. Our findings show that PERL achieves comparable performance to RLHF while significantly reducing training time (up to 2x faster for reward models and 15\% faster for RL loops), and memory footprint (up to 50\% reduction for reward models and 25\% for RL loops). Finally, we provide a single set of parameters that achieves results on par with RLHF on every task, which shows the accessibility of the method. By mitigating the computational cost and the burden of hyperparameter search, PERL facilitates broader adoption of RLHF as an LLM alignment technique. View details
    Preview abstract An Explorable explaining the concept of patchoscopes for an external audience. Patchoscopes is an interpretability tool that allows researchers to better understand an LLMs output representations through natural language experiments. View details
    LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models
    Minsuk Kahng
    Michael Xieyang Liu
    Krystal Kallarackal
    Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA '24), ACM (2024)
    Preview abstract Automatic side-by-side evaluation has emerged as a promising approach to evaluating the quality of responses from large language models (LLMs). However, analyzing the results from this evaluation approach raises scalability and interpretability challenges. In this paper, we present LLM Comparator, a novel visual analytics tool for interactively analyzing results from automatic side-by-side evaluation. The tool supports interactive workflows for users to understand when and why a model performs better or worse than a baseline model, and how the responses from two models are qualitatively different. We iteratively designed and developed the tool by closely working with researchers and engineers at Google. This paper details the user challenges we identified, the design and development of the tool, and an observational study with participants who regularly evaluate their models. View details
    Preview abstract Inspecting hidden representations of large language models (LLM) is of growing interest. Not only to understand a model's behavior and verify its alignment with human values, but also to control it before it goes awry. Given the capabilities of LLMs in generating coherent, human-understandable text, we propose leveraging the model itself to explain its internal representations in natural language. We introduce a framework called Patchscopes and show how it can be used to answer various kinds of questions, which we refer in the singular, as a specific Patchscope. We show that many prior inspection methods based on projecting the representations into the vocabulary space, such as logit lens, tuned lens, and linear shortcuts, can be viewed as special instances of this framework. Moreover, several of their shortcomings such as failure in inspecting early layers or lack of expressivity can be mitigated by a Patchscope. Beyond unifying prior inspection techniques, Patchscopes also opens up new possibilities, such as using a more capable model to explain the representations of a smaller model. Finally, we demonstrate the utility of Patchscopes for practical applications, such as harmful belief extraction and self-correction in multi-hop reasoning. View details
    Preview abstract An Explorable explaining the concept of patchoscopes for an external audience. Patchoscopes is an interpretability tool that allows researchers to better understand an LLMs output representations through natural language experiments. View details
    Preview abstract A common method to study deep learning systems is to create simplified representations---for example, using singular value decomposition to visualize the model's hidden states in a lower dimensional space. This approach assumes that the simplified model is faithful to the original model. Here, we illustrate an important caveat to this assumption: even if a simplified representation of the model can accurately approximate the original model on the training set, it may fail to match its behavior out of distribution; the understanding developed from simplified representations may be an illusion. We illustrate this by training Transformer models on controlled datasets with systematic generalization splits, focusing on the Dyck balanced-parenthesis languages. We simplify these models using tools like dimensionality-reduction and clustering, and find clear patterns in the resulting representations. We then explicitly test how these simplified proxy models match the original models behavior on various out-of-distribution test sets. Generally, the simplified proxies are less faithful out of distribution. For example, in cases where the original model generalizes to novel structures or deeper depths, the simplified model may fail to generalize, or may generalize too well. We then show the generality of these results: even model simplifications that do not directly use data can be less faithful out of distribution, and other tasks can also yield generalization gaps. Our experiments raise questions about the extent to which mechanistic interpretations derived using tools like SVD can reliably predict what a model will do in novel situations. View details
    "We Need Structured Output": Towards User-centered Constraints on Large Language Model Output
    Michael Xieyang Liu
    Frederick Liu
    Alex Fiannaca
    Terry Koo
    In Extended Abstract in ACM CHI Conference on Human Factors in Computing Systems (CHI EA '24), ACM (2024), pp. 9 (to appear)
    Preview abstract Large language models can produce creative and diverse responses. However, to integrate them into current developer workflows, it is essential to constrain their outputs to follow specific formats or standards. In this work, we surveyed 51 experienced industry professionals to understand the range of scenarios and motivations driving the need for output constraints from a user-centered perspective. We identified 134 concrete use cases for constraints at two levels: low-level, which ensures the output adhere to a structured format and an appropriate length, and high-level, which requires the output to follow semantic and stylistic guidelines without hallucination. Critically, applying output constraints could not only streamline the currently repetitive process of developing, testing, and integrating LLM prompts for developers, but also enhance the user experience of LLM-powered features and applications. We conclude with a discussion on user preferences and needs towards articulating intended constraints for LLMs, alongside an initial design for a constraint prototyping tool. View details
    Preview abstract In 2021, researchers made a striking discovery while training a series of tiny models on toy tasks [1]. They found a set of models that suddenly flipped from memorizing their training data to correctly generalizing on unseen inputs after training for much longer. This phenomenon – where generalization seems to happen abruptly and long after fitting the training data – is called grokking and has sparked a flurry of interest [2, 3, 4, 5, 6]. Do more complex models also suddenly generalize after they’re trained longer? Large language models can certainly seem like they have a rich understanding of the world, but they might just be regurgitating memorized bits of the enormous amount of text they’ve been trained on [7, 8]. How can we tell if they’re generalizing or memorizing? In this article we’ll examine the training dynamics of a tiny model and reverse engineer the solution it finds – and in the process provide an illustration of the exciting emerging field of mechanistic interpretability [9, 10]. While it isn’t yet clear how to apply these techniques to today’s largest models, starting small makes it easier to develop intuitions as we progress towards answering these critical questions about large language models. View details
    Preview abstract Traditional recommender systems leverage users' item preference history to recommend novel content that users may like. However, dialog interfaces that allow users to express language-based preferences offer a fundamentally different modality for preference input. Inspired by recent successes of prompting paradigms for large language models (LLMs), we study their use for making recommendations from both item-based and language-based preferences in comparison to state-of-the-art item-based collaborative filtering (CF) methods. To support this investigation, we collect a new dataset consisting of both item-based and language-based preferences elicited from users along with their ratings on a variety of (biased) recommended items and (unbiased) random items. Among numerous experimental results, we find that LLMs provide competitive recommendation performance for pure language-based preferences (no item preferences) in the near cold-start case in comparison to item-based CF methods, despite having no supervised training for this specific task (zero-shot) or only a few labels (few-shot). This is particularly promising as language-based preference representations are more explainable and scrutable than item-based or vector-based representations. View details
    ×