Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 10793 publications
    Productionizing Quantum Mass Production
    Bill Huggins
    Nathan Wiebe
    arXiv for now (2026) (to appear)
    Preview abstract For many practical applications of quantum computing, the slowest and most costly steps involve coherently accessing classical data. We help address this challenge by applying mass production techniques, which can sometimes allow us to perform operations many times in parallel for a cost that is comparable to a single execution[1-3]. We combine existing mass-production results with modern approaches for loading classical data using ``quantum read-only memory.'' We show that quantum mass production techniques offer no benefit when we consider a cost model that focuses purely on the number of non-Clifford gates. However, analyzing the constant factors in a more nuanced cost model, we find that it may be possible to obtain a reduction in cost of an order or magnitude or more for a variety reasonably-sized fault-tolerant quantum algorithms. We present several applications of quantum mass-production techniques beyond naive parallelization, including a strategy for reducing the cost of serial calls to the same data loading step. View details
    FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration
    Diganta Misra
    Yanqi Luo
    Anjali Sridhar
    Justine Gehring
    Silvio Soares Ribeiro Junior
    2026
    Preview abstract AI coding assistants are rapidly becoming integral to modern software development. A key challenge in this space is the continual need to migrate and modernize codebases in response to evolving software ecosystems. Traditionally, such migrations have relied on rule-based systems and human intervention. With the advent of powerful large language models (LLMs), AI-driven agentic frameworks offer a promising alternative—but their effectiveness remains underexplored. In this paper, we introduce FreshBrew, a novel benchmark for evaluating AI-based agentic frameworks on project-level Java migrations. We benchmark several such frameworks, powered by state-of-the-art LLMs, and compare their performance against established rule-based tools. Our evaluation of AI agents on this benchmark of 228 repositories shows that the top-performing model, Gemini 2.5 Flash, can successfully migrate 56.5% of projects to JDK 17. Our empirical analysis reveals novel insights into the critical strengths and limitations of current agentic approaches, offering actionable insights into their real-world applicability. By releasing FreshBrew publicly upon acceptance, we aim to facilitate rigorous, reproducible evaluation and catalyze progress in AI-driven codebase modernization. View details
    Quartic Quantum Speedups for Planted Inference Problems
    Alexander Schmidhuber
    Ryan O'Donnell
    Physical Review X, 15 (2025), pp. 021077
    Preview abstract We describe a quantum algorithm for the Planted Noisy kXOR problem (also known as sparse Learning Parity with Noise) that achieves a nearly quartic (4th power) speedup over the best known classical algorithm while also only using logarithmically many qubits. Our work generalizes and simplifies prior work of Hastings, by building on his quantum algorithm for the Tensor Principal Component Analysis (PCA) problem. We achieve our quantum speedup using a general framework based on the Kikuchi Method (recovering the quartic speedup for Tensor PCA), and we anticipate it will yield similar speedups for further planted inference problems. These speedups rely on the fact that planted inference problems naturally instantiate the Guided Sparse Hamiltonian problem. Since the Planted Noisy kXOR problem has been used as a component of certain cryptographic constructions, our work suggests that some of these are susceptible to super-quadratic quantum attacks. View details
    Preview abstract This short paper describes a new circuit to measure surface codes, which allows them to be implemented on the heavy-square lattice. The circuits perform far worse than the usual surface code, but are more efficient in terms of the distance they can achieve for a given number of qubits and couplers. Paper Abstract: We present and benchmark an interesting subfamily of circuits within the LUCI framework, which we refer to as diamond circuits, that implement a surface code on a Lieb or “Heavy-Square” lattice. This makes them more qubit- and measurement-efficient than previous constructions. These circuits are built around a mid-cycle state that resembles a Bravyi-Bacon-Shor surface code on the data and measurement qubits. These circuits preserve the spacelike distance of the code, but suffer a penalty in timelike distance. This could be useful in regimes where quantum computers are limited by the number of control lines or frequency collisions. View details
    Preview abstract Virtual Reality headsets isolate users from the real-world by restricting their perception to the virtual-world. Video See-Through (VST) headsets address this by utilizing world-facing cameras to create Augmented Reality experiences. However, directly displaying camera feeds can cause visual discomfort and cybersickness due to the inaccurate perception of scale and exaggerated motion parallax. This paper presents initial findings on the potential of geometry aware passthrough systems to mitigate cybersickness through enhanced depth perception. We introduce a promising protocol for quantitatively measuring cybersickness experienced by users in VST headsets. Using this protocol, we conduct a user study to compare direct passthrough and geometry aware passthrough systems. To the best of our knowledge, our study is the first one to reveal reduced nausea, disorientation, and total scores of cybersickness with geometry aware passthrough. It also uncovers several potential avenues to further mitigate visually-induced discomfort. View details
    Preview abstract JuMP and MathOptInterface.jl give access to many solvers, both very common in the industry and more specialised. Google offers its own in-house solvers as part of the open-source package OR-Tools: Glop, a simplex solver; CP-SAT, an award-winning constraint-programming solver; PDLP, a first-order solver for large-scale linear programming. ORTools.jl is a recent package that gives access to these solvers through MathOptInterface.jl. It supports both local and remote use, meaning that users do not need a local installation to solve linear and integer problems thanks to Google Cloud. More recently, ORTools.jl started offering a native interface for constraint programming building upon the work in MathOptInterface.jl. However, OR-Tools have more than this to offer, including a scalable routing solver for large-scale VRPs or state-of-the-art set-cover solver. MathOptInterface.jl does not yet propose an interface for these problems and we would like to gauge the community’s interest in these specific solvers. View details
    Preview abstract Deep residual architectures, such as ResNet and the Transformer, have enabled models of unprecedented depth, yet a formal understanding of why depth is so effective remains an open question. A popular intuition, following Veit et al. (2016), is that these residual networks behave like ensembles of many shallower models. Our key finding is an explicit analytical formula that verifies this ensemble perspective, proving that increasing network depth is mathematically equivalent to expanding the size of this implicit ensemble. Furthermore, our expansion reveals a hierarchical ensemble structure in which the combinatorial growth of computation paths leads to an explosion in the output signal, explaining the historical necessity of normalization layers in training deep models. This insight offers a first-principles explanation for the historical dependence on normalization layers and sheds new light on a family of successful normalization-free techniques like SkipInit and Fixup. However, while these previous approaches infer scaling factors through optimizer analysis or a heuristic analogy to Batch Normalization, our work offers the first explanation derived directly from the network’s inherent functional structure. Specifically, our Residual Expansion Theorem reveals that scaling each residual module provides a principled solution to taming the combinatorial explosion inherent to these architectures. We further show that this scaling acts as a capacity controls that also implicitly regularizes the model’s complexity. View details
    Preview abstract Computer use agents (CUAs) need to plan long-horizon task workflows grounded in diverse, ever-changing applications and environments, but learning is hindered by the scarcity of large-scale, high-quality training data. Existing datasets are small, domain-specific, and costly to annotate, while current synthetic data generation methods often yield brittle, simplistic, or misaligned task demonstrations. We introduce Watch & Learn (W&L), a framework that transforms human demonstration videos available in the Internet into executable UI trajectories at scale. Inspired by robotics, we train an inverse dynamics model that accurately predicts user actions from consecutive screens, bypassing the need for complex heuristics. To scale to the web, we curate a large state-transition corpus and design a retrieval framework that identifies relevant video tutorials, enabling automatic conversion of raw videos into structured UI trajectories without requiring manual annotations. Beyond training data, we show that the generated UI trajectories can also serve as in-context exemplars, providing CUAs with long-horizon priors and domain-specific knowledge at inference time. On the challenging OSWorld and Mind2Web benchmarks, UI trajectories extracted with W&L consistently improve both general-purpose and state-of-the-art frameworks when used in-context, and delivers stronger gains for open-source models when used in training. These results highlight web-scale human demonstration videos as a practical and scalable foundation for advancing CUAs towards real-world deployment. View details
    Fast Tensor Completion via Approximate Richardson Iteration
    Mehrdad Ghadiri
    Yunbum Kook
    Ali Jadbabaie
    Proceedings of the 42nd International Conference on Machine Learning (2025)
    Preview abstract We study tensor completion (TC) through the lens of low-rank tensor decomposition (TD). Many TD algorithms use fast alternating minimization methods, which solve highly structured linear regression problems at each step (e.g., for CP, Tucker, and tensor-train decompositions). However, such algebraic structure is lost in TC regression problems, making direct extensions unclear. To address this, we propose a lifting approach that approximately solves TC regression problems using structured TD regression algorithms as blackbox subroutines, enabling sublinear-time methods. We theoretically analyze the convergence rate of our approximate Richardson iteration based algorithm, and we demonstrate on real-world tensors that its running time can be 100x faster than direct methods for CP completion. View details
    Preview abstract Test-time scaling has shown considerable success in improving the performance of language models on complex reasoning tasks without requiring fine-tuning. However, current strategies, such as self-reflection or ensembling, primarily focus on logical or structural refinement. They do not leverage the guiding potential of affective feedback. Inspired by psychological research showing that emotions can modulate cognitive performance, we introduce HEART--a novel framework that uses emotionally-driven prompts for iterative self-correction. HEART provides feedback on a models' incorrect response using a curated set of concise, emotionally charged phrases based on Paul Ekman's six basic emotions. By systematically varying the emotional tone of the feedback across iterations, our method guides the model to escape flawed reasoning paths and explore more promising alternatives. We evaluate our framework on challenging reasoning benchmarks including OlympiadBench, Humanity's Last Exam, and SimpleQA. Across these benchmarks, our approach delivers significantly deeper reasoning which leads to consistent and significant increase in accuracy compared to existing prompting methods. Crucially, these gains are observed across a diverse range of model architectures, demonstrating the broad applicability of our technique. Overall, our findings suggest that the next frontier in machine reasoning may lie not just in refining logic, but also in understanding and leveraging the 'HEART' of the models. View details
    The Anatomy of a Personal Health Agent
    Ahmed Metwally
    Ken Gu
    Jiening Zhan
    Kumar Ayush
    Hong Yu
    Amy Lee
    Qian He
    Zhihan Zhang
    Isaac Galatzer-Levy
    Xavi Prieto
    Andrew Barakat
    Ben Graef
    Yuzhe Yang
    Daniel McDuff
    Brent Winslow
    Shwetak Patel
    Girish Narayanswamy
    Conor Heneghan
    Max Xu
    Jacqueline Shreibati
    Mark Malhotra
    Orson Xu
    Tim Althoff
    Tony Faranesh
    Nova Hammerquist
    Vidya Srinivas
    arXiv (2025)
    Preview abstract Health is a fundamental pillar of human wellness, and the rapid advancements in large language models (LLMs) have driven the development of a new generation of health agents. However, the solution to fulfill diverse needs from individuals in daily non-clinical settings is underexplored. In this work, we aim to build a comprehensive personal health assistant that is able to reason about multimodal data from everyday consumer devices and personal health records. To understand end users’ needs when interacting with such an assistant, we conducted an in-depth analysis of query data from users, alongside qualitative insights from users and experts gathered through a user-centered design process. Based on these findings, we identified three major categories of consumer health needs, each of which is supported by a specialist subagent: (1) a data science agent that analyzes both personal and population-level time-series wearable and health record data to provide numerical health insights, (2) a health domain expert agent that integrates users’ health and contextual data to generate accurate, personalized insights based on medical and contextual user knowledge, and (3) a health coach agent that synthesizes data insights, drives multi-turn user interactions and interactive goal setting, guiding users using a specified psychological strategy and tracking users’ progress. Furthermore, we propose and develop a multi-agent framework, Personal Health Insight Agent Team (PHIAT), that enables dynamic, personalized interactions to address individual health needs. To evaluate these individual agents and the multi-agent system, we develop a set of N benchmark tasks and conduct both automated and human evaluations, involving 100’s of hours of evaluation from health experts, and 100’s of hours of evaluation from end-users. Our work establishes a strong foundation towards the vision of a personal health assistant accessible to everyone in the future and represents the most comprehensive evaluation of a consumer AI health agent to date. View details
    Preview abstract Recently, decomposing complex problems into simple subtasks--a crucial part of human-like natural planning--to solve the given problem has significantly boosted the performance of large language models (LLMs). However, leveraging such planning structures during post-training to boost the performance of smaller open-source LLMs remains underexplored. Motivated by this, we introduce Plan-Tuning, a unified post-training framework that (i) distills synthetic task decompositions (termed “planning trajectories”) from large-scale LLMs and (ii) fine-tunes smaller models via supervised and reinforcement-learning objectives designed to mimic these planning processes to improve complex reasoning. On GSM8k and the MATH benchmarks, plan-tuned models outperform strong baselines by an average ~7%. Furthermore, plan-tuned models show better generalization capabilities on out-of-domain datasets, with average ~10% and ~12% performance improvements on OlympiadBench and AIME 2024, respectively. Our detailed analysis demonstrates how planning trajectories improves complex reasoning capabilities, showing that Plan-Tuning is an effective strategy for improving task-specific performance of smaller LLMs. View details
    A Scalable Framework for Evaluating Health Language Models
    Neil Mallinar
    Tony Faranesh
    Brent Winslow
    Nova Hammerquist
    Ben Graef
    Cathy Speed
    Mark Malhotra
    Shwetak Patel
    Xavi Prieto
    Daniel McDuff
    Ahmed Metwally
    (2025)
    Preview abstract Large language models (LLMs) have emerged as powerful tools for analyzing complex datasets. Recent studies demonstrate their potential to generate useful, personalized responses when provided with patient-specific health information that encompasses lifestyle, biomarkers, and context. As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety. Current evaluation practices for open-ended text responses heavily rely on human experts. This approach introduces human factors and is often cost-prohibitive, labor-intensive, and hinders scalability, especially in complex domains like healthcare where response assessment necessitates domain expertise and considers multifaceted patient data. In this work, we introduce Adaptive Precise Boolean rubrics: an evaluation framework that streamlines human and automated evaluation of open-ended questions by identifying gaps in model responses using a minimal set of targeted rubrics questions. Our approach is based on recent work in more general evaluation settings that contrasts a smaller set of complex evaluation targets with a larger set of more precise, granular targets answerable with simple boolean responses. We validate this approach in metabolic health, a domain encompassing diabetes, cardiovascular disease, and obesity. Our results demonstrate that Adaptive Precise Boolean rubrics yield higher inter-rater agreement among expert and non-expert human evaluators, and in automated assessments, compared to traditional Likert scales, while requiring approximately half the evaluation time of Likert-based methods. This enhanced efficiency, particularly in automated evaluation and non-expert contributions, paves the way for more extensive and cost-effective evaluation of LLMs in health. View details
    Preview abstract Retrieval Augmented Generation (RAG) is a commonly used approach for enhancing LLMs with relevant and up-to-date information. However, the retrieved sources can often bring conflicting information and it is not clear how models address such discrepancies. In this work, we first point out that knowledge conflicts stem from various reasons and thus require tailored solutions in order to better align model responses to human preferences. To that end, we introduce a novel taxonomy of knowledge conflicts in RAG and define the desired model’s behavior for each category. Additionally, we construct a high-quality benchmark by asking two expert annotators to identify the conflict type within realistic RAG instances, each comprising a query and its associated search results. Finally, we conduct extensive experiments and show that explicitly informing LLMs about the potential conflict category significantly improves the quality and appropriateness of the responses. Yet, there is still a vast room for improvement. Taken together, our work highlights the importance of evaluating RAG systems not only on factual accuracy but also on their ability to manage and resolve knowledge conflicts effectively. View details