Tomas Pfister

Tomas Pfister

Tomas Pfister is the Head of Cloud AI Research. He came to Google from Apple where he cofounded Apple's central AI research group and published Apple’s first research paper that won the Best Paper Award at CVPR’17. Tomas’ key scientific achievements have been proposing a method to improve the realism of synthetic images; developing the first automated method to detect facial micro-expressions; and inventing a new way for neural networks to exploit spatiotemporal structure. He is currently exploring learning from small amount of labeled data (using techniques such as generative models, few-shot learning, transfer learning) and explainability/interpretability of deep learning models, and is particularly excited about the potential of AI in healthcare & education. His research has laid the foundation for several applications such as Face ID in iPhone X, autonomous driving, human pose estimation, detecting facial micro-expressions & translating sign language. Tomas did his PhD in deep learning with Prof Andrew Zisserman at Oxford University and bachelor’s degree in computer science at Cambridge University. He is the recipient of the Forbes 30 Under 30 award, and has received over 40 research awards, including 3 best paper awards, with numerous publications in top AI research venues. His work has been frequently featured in mainstream media, including Forbes, BusinessInsider & Wired.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Artificial intelligence is rapidly evolving, marked by the emergence of Large Language Model (LLM) agents – systems capable of complex reasoning, planning, and interaction with digital and physical environments. These agents, powered by advancements in LLMs, demonstrate remarkable capabilities across diverse domains, including finance, healthcare, web navigation, software development, and daily task assistance. Unlike traditional AI systems, LLM agents can perceive their surroundings, formulate multi-step plans, utilize external tools and APIs, access memory or knowledge bases, and execute actions to achieve specified goals. This ability to act upon the world, however, introduces significant safety and security challenges. The safety paradigms developed for traditional LLMs, primarily focused on mitigating harmful textual outputs (e.g., toxicity, bias), are insufficient for safeguarding LLM agents. Agents interacting with dynamic environments and executing actions present a broader attack surface and new categories of risk. These include performing unsafe operations, violating privacy constraints through improper data handling or access control failures, deviating from user objectives (task misalignment), and susceptibility to novel manipulation techniques like indirect prompt injection and memory poisoning. Ensuring the trustworthy operation of these powerful agents is paramount, especially as they are integrated into high-stakes applications. To address this critical challenge, we introduce VeriGuard, a novel framework designed to enhance the safety and reliability of LLM agents by interactively verifying their policies and the actions. VeriGuard integrates a verification module that intercepts code-based actions proposed by the agent. In the first step, VeriGuard will generates and verifies the policies. The policies are rigorously checked against a set of predefined safety and security specifications Then each action will be verified to make sure it will align with the agent specification. This interactive verification loop ensures that the agent's behavior remains within safe operational bounds, effectively preventing the execution of harmful or unintended operations. By verifying each step, VeriGuard provides a robust safeguard, substantially improving the trustworthiness of LLM agents in complex, real-world environments. View details
    Preview abstract Integrating tools like Code Interpreter and Search has significantly improved Large Language Models (LLMs) reasoning, as shown by leading models such as OpenAI's ChatGPT Agent, Google's Gemini-Pro, and XAI's Grok4. However, the research community still lacks practical guidance on fully leveraging these tools. The main challenge lies in finding an effective method to fully exploit the benefits of textual reasoning, coding, and searching when facing distinctive questions. To address this, we propose an ensemble-based framework that runs multiple agents in parallel, each exploring different answer paths with distinct tool-use strategies. Agents iteratively share and refine their answers by considering the original question and previous responses. Our proposed method Tool-Use Mixture (TUMIX) achieves significant gains over other representative tool-augmented test-time scaling methods such as Self-MoA, Symbolic-MoE, DEI, SciMaster, and GSA. With near equal inference costs, TUMIX delivers an average +3.55% accuracy improvement over the best baseline on Gemini-2.5-Pro and Gemini-2.5-Flash across key reasoning benchmarks (HLE, GPQA, AIME 24&25), where coding and search can effectively support reasoning when applied properly. We find that agent diversity and quality are crucial, and can be further improved by querying LLMs to automatically optimize agent designs. To reduce costs, TUMIX halts refinement once sufficient confidence is reached, preserving nearly the same performance at just 49% of the inference cost. With further scaling, TUMIX can achieve even higher performance, though at substantially greater cost. View details
    Preview abstract We propose a principled method to synthesize high-quality multi-turn function calling trajectories to align large language model (LLM)-based agents. We start with iteratively building function calling graph and defining node operations to increase its complexity. This enables us to construct reliable reference. Then, based on the synthesized function calling graph, we adopt back-and-forth translation to first construct multi-turn user queries and then, fill in the function arguments with information in the query. We sample positive trajectories that distill the function graph reference and negative trajectories that contrast with the positive trajectories in targeted loss patterns in multi-turn scenarios. Training with the positive trajectories with supervised fine-tuning and preference optimization against negative trajectories, we obtain 67.42 on BFCL and 71.7 on ToolQuery with an open-sourced model with 14B parameters, surpassing the performance of strong proprietary models like o1. View details
    Preview abstract Computer use agents (CUAs) need to plan long-horizon task workflows grounded in diverse, ever-changing applications and environments, but learning is hindered by the scarcity of large-scale, high-quality training data. Existing datasets are small, domain-specific, and costly to annotate, while current synthetic data generation methods often yield brittle, simplistic, or misaligned task demonstrations. We introduce Watch & Learn (W&L), a framework that transforms human demonstration videos available in the Internet into executable UI trajectories at scale. Inspired by robotics, we train an inverse dynamics model that accurately predicts user actions from consecutive screens, bypassing the need for complex heuristics. To scale to the web, we curate a large state-transition corpus and design a retrieval framework that identifies relevant video tutorials, enabling automatic conversion of raw videos into structured UI trajectories without requiring manual annotations. Beyond training data, we show that the generated UI trajectories can also serve as in-context exemplars, providing CUAs with long-horizon priors and domain-specific knowledge at inference time. On the challenging OSWorld and Mind2Web benchmarks, UI trajectories extracted with W&L consistently improve both general-purpose and state-of-the-art frameworks when used in-context, and delivers stronger gains for open-source models when used in training. These results highlight web-scale human demonstration videos as a practical and scalable foundation for advancing CUAs towards real-world deployment. View details
    Preview abstract While Large Language Models (LLMs) have shown remarkable advancements in reasoning and tool use, they often fail to generate optimal, grounded solutions under complex constraints. Real-world travel planning exemplifies these challenges, evaluating agents' abilities to handle constraints that are explicit, implicit, and even evolving based on interactions with dynamic environments and user needs. In this paper, we present ATLAS, a general multi-agent framework designed to effectively handle such complex nature of constraints awareness in real-world travel planning tasks. Our framework introduces a principled approach to address the fundamental challenges of constraint-aware planning through dedicated mechanisms for dynamic constraint management, iterative plan critique, and adaptive interleaved search. ATLAS demonstrates state-of-the-art performance on the TravelPlanner benchmark, improving the final pass rate from 17.8% to 44.4% over its best alternative. More importantly, this is the first work to be evaluated in and demonstrate quantitative effectiveness on real-world travel planning with live information search and multi-turn feedback. In this realistic setting, ATLAS demonstrates its ability to adapt to multi-turn user feedback, achieving an 84% final pass rate which significantly outperforms baselines including ReAct (59%) and a monolithic agent (27%). View details
    Preview abstract Automating data visualization from natural language is crucial for data science, yet current systems struggle with complex, multi-file datasets and iterative refinement. Existing approaches, including simple single- or multi-agent systems, often oversimplify the task, focusing on initial query parsing while failing to robustly manage data complexity, code errors, or final visualization quality. In this paper, we reframe this challenge as a collaborative multi-agent problem. We introduce CoDA, a multi-agent system that employs specialized LLM agents for metadata analysis, task planning, code generation, and iterative reflection. We formalize this pipeline, demonstrating how metadata-focused analysis bypasses token limits and quality-driven refinement ensures robustness. Extensive evaluations show CoDA achieves substantial accuracy gains, outperforming competitive baselines by up to 49.0%. This work advocates that future visualization automation should evolve from isolated code generation to integrated, collaborative agentic workflows. View details
    Preview abstract The proliferation of Large Language Models (LLMs) has opened new opportunities in data science, yet their practical deployment is often constrained by the challenge of discovering relevant data within large and heterogeneous data lakes. Existing approaches, including single-agent and master–slave multi-agent systems, struggle with scalability, information heterogeneity, and robustness to irrelevant files. To address these limitations, we propose a novel multi-agent communication paradigm inspired by the blackboard architecture in traditional AI and software design. In this framework, a central agent posts information requests to a shared blackboard, and autonomous subordinate agents---each responsible for a partition of the data lake---volunteer to respond based on their capabilities. This distributed design improves scalability and flexibility by eliminating the need for a central coordinator to have prior knowledge of agent expertise. We evaluate the approach on three benchmarks that require explicit data discovery: KramaBench and modified versions of DS-Bench and DA-Code to incorporate data discovery. Experimental results demonstrate that the blackboard architecture substantially outperforms baselines, including RAG and the master–slave paradigm, achieving 13% to 57% relative improvement in end-to-end task success and up to a 9% relative gain in F1 score for data discovery across both proprietary and open-source LLMs. These findings establish the blackboard paradigm as a scalable and generalizable communication framework for multi-agent data science systems. View details
    Preview abstract Test-time scaling has shown considerable success in improving the performance of language models on complex reasoning tasks without requiring fine-tuning. However, current strategies, such as self-reflection or ensembling, primarily focus on logical or structural refinement. They do not leverage the guiding potential of affective feedback. Inspired by psychological research showing that emotions can modulate cognitive performance, we introduce HEART--a novel framework that uses emotionally-driven prompts for iterative self-correction. HEART provides feedback on a models' incorrect response using a curated set of concise, emotionally charged phrases based on Paul Ekman's six basic emotions. By systematically varying the emotional tone of the feedback across iterations, our method guides the model to escape flawed reasoning paths and explore more promising alternatives. We evaluate our framework on challenging reasoning benchmarks including OlympiadBench, Humanity's Last Exam, and SimpleQA. Across these benchmarks, our approach delivers significantly deeper reasoning which leads to consistent and significant increase in accuracy compared to existing prompting methods. Crucially, these gains are observed across a diverse range of model architectures, demonstrating the broad applicability of our technique. Overall, our findings suggest that the next frontier in machine reasoning may lie not just in refining logic, but also in understanding and leveraging the 'HEART' of the models. View details
    Preview abstract Recent advances in knowledge distillation (KD) have enabled smaller student models to approach the performance of larger teacher models. However, popular methods such as supervised KD and on-policy KD, are adversely impacted by the knowledge gaps between teacher-student in practical scenarios. Supervised KD suffers from a distribution mismatch between training with a static dataset and inference over final student-generated outputs. Conversely, on-policy KD, which uses student-generated samples for training, can suffer from low-quality training examples with which teacher models are not familiar, resulting in inaccurate teacher feedback. To address these limitations, we introduce Speculative Knowledge Distillation (SKD), a novel approach that leverages cooperation between student and teacher models to generate high-quality training data on-the-fly while aligning with the student’s inference-time distribution. In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution, transferring high-quality knowledge adaptively. We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following, and show that SKD consistently outperforms existing KD methods across different domains, data sizes, and model initialization strategies View details
    Preview abstract Recent knowledge distillation (KD) research made significant progress on improving smaller student models to match larger teachers' performances. Two noticeable methods, supervised KD and on-policy KD emerged as the state-of-the-art approaches. However, supervised KD for auto-regressive models suffers from distribution mismatch between training over fixed dataset and inference over student generated outputs. Conversely, on-policy KD, which uses student-generated samples for training, can suffer from low-quality training examples and the teacher's potential inaccuracies in assessing these samples. To address these limitations, we introduce Speculative Knowledge Distillation (SKD). Instead of solely training on teacher- or student-proposed samples, SKD leverages the student model to initially propose tokens following its own generation distribution. Subsequently, the teacher model is employed to replace tokens that are deemed out-of-distribution. Compared with supervised KD, the samples generated by SKD are more likely to align with the student's inference-time distribution, and 2) SKD can mitigate the generation of low-quality sequences by incorporating the teacher's feedback at each token. Furthermore, we demonstrate that SKD is a generic framework capable of implementing both supervised and on-policy knowledge distillation as specific instances. To validate SKD's effectiveness, we apply it to distill autoregressive large language models for various tasks, including translation, summarization, math, and instruction following. Our experiments consistently demonstrate SKD's superior performance compared to existing methods across different domains, tasks, data sizes, and model initialization strategies. View details
    ×