Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 10822 publications
    Productionizing Quantum Mass Production
    Bill Huggins
    Nathan Wiebe
    arXiv for now (2026) (to appear)
    Preview abstract For many practical applications of quantum computing, the slowest and most costly steps involve coherently accessing classical data. We help address this challenge by applying mass production techniques, which can sometimes allow us to perform operations many times in parallel for a cost that is comparable to a single execution[1-3]. We combine existing mass-production results with modern approaches for loading classical data using ``quantum read-only memory.'' We show that quantum mass production techniques offer no benefit when we consider a cost model that focuses purely on the number of non-Clifford gates. However, analyzing the constant factors in a more nuanced cost model, we find that it may be possible to obtain a reduction in cost of an order or magnitude or more for a variety reasonably-sized fault-tolerant quantum algorithms. We present several applications of quantum mass-production techniques beyond naive parallelization, including a strategy for reducing the cost of serial calls to the same data loading step. View details
    FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration
    Diganta Misra
    Yanqi Luo
    Anjali Sridhar
    Justine Gehring
    Silvio Soares Ribeiro Junior
    2026
    Preview abstract AI coding assistants are rapidly becoming integral to modern software development. A key challenge in this space is the continual need to migrate and modernize codebases in response to evolving software ecosystems. Traditionally, such migrations have relied on rule-based systems and human intervention. With the advent of powerful large language models (LLMs), AI-driven agentic frameworks offer a promising alternative—but their effectiveness remains underexplored. In this paper, we introduce FreshBrew, a novel benchmark for evaluating AI-based agentic frameworks on project-level Java migrations. We benchmark several such frameworks, powered by state-of-the-art LLMs, and compare their performance against established rule-based tools. Our evaluation of AI agents on this benchmark of 228 repositories shows that the top-performing model, Gemini 2.5 Flash, can successfully migrate 56.5% of projects to JDK 17. Our empirical analysis reveals novel insights into the critical strengths and limitations of current agentic approaches, offering actionable insights into their real-world applicability. By releasing FreshBrew publicly upon acceptance, we aim to facilitate rigorous, reproducible evaluation and catalyze progress in AI-driven codebase modernization. View details
    Synthesizing Privacy-Preserving Text Data via Finetuning without Finetuning Billion-Scale LLMs
    Bowen Tan
    Zheng Xu
    Eric Xing
    Zhiting Hu
    International Conference on Machine Learning (ICML) (2025)
    Preview abstract Synthetic data offers a promising path to train models while preserving data privacy. Differentially private (DP) finetuning of large language models (LLMs) as data generator is effective, but is impractical when computation resources are limited. Meanwhile, prompt-based methods such as private evolution depend heavily on the manual prompts, and ineffectively use private information in their iterative data selection process. To overcome these limitations, we propose CTCL (Data Synthesis with ConTrollability and CLustering), a novel framework for generating privacy-preserving synthetic data without extensive prompt engineering or billion-scale LLM finetuning. CTCL pretrains a lightweight 140M conditional generator and a clustering-based topic model on large-scale public data. To further adapt to the private domain, the generator is DP finetuned on private data for fine-grained textual information, while the topic model extracts a DP histogram representing distributional information. The DP generator then samples according to the DP histogram to synthesize a desired number of data examples. Evaluation across five diverse domains demonstrates the effectiveness of our framework, particularly in the strong privacy regime. Systematic ablation validates the design of each framework component and highlights the scalability of our approach. View details
    Privacy-preserved LLM Cascade via CoT-enhanced Policy Learning
    Xiaozhong Liu
    Kai Zhang
    Congchao Wang
    Liqian Peng
    2025
    Preview abstract Large Language Models (LLMs) have seen increasing attentions in on-device applications due to their exceptional ability in real-world tasks. However, device-end LLM often performs suboptimal due to the hardware limitation. Cascading local (on-device) weaker and server stronger LLMs presents a promising solution to this challenge. While existing research on LLM cascade primarily focuses on optimizing the performance-cost trade-off, privacy concerns remain largely unaddressed. In this work, we prioritize privacy-preserved LLM cascading while enhancing cascade efficiency. To this end, we propose a novel CoT-enhanced policy learning strategy for deferral decision-making, which accounts for both performance-cost trade-offs and privacy considerations. Extensive experiments on three benchmark datasets validate the effectiveness and superiority of our approach. View details
    PageFlex: Flexible and Efficient User-space Delegation of Linux Paging Policies with eBPF
    Kan Wu
    Zhiyuan Guo
    Suli Yang
    Rajath Shashidhara
    Wei Xu
    Alex Snoeren
    Kim Keeton
    2025
    Preview abstract To increase platform memory efficiency, hyperscalers like Google and Meta transparently demote “cold” application data to cheaper cost-per-byte memory tiers like compressed memory and NVMe SSDs. These systems rely on standard kernel paging policies and mechanisms to maximize the achievable memory savings without hurting application performance. Although the literature promises better policies, implementing and deploying them within the Linux kernel is challenging. Delegating policies and mechanisms to user space, through userfaultfd or library-based approaches, incurs overheads and may require modifying application code. We present PageFlex, a framework for delegating Linux paging policies to user space with minimal overhead and full compatibility with existing real-world deployments. PageFlex uses eBPF to delegate policy decisions while providing low-overhead access to in-kernel memory state and access information, thus balancing flexibility and performance. Additionally, PageFlex supports different paging strategies for distinct memory regions and application phases. We show that PageFlex can delegate existing kernel-based policies with little (< 1%) application slowdown, effectively realizing the benefits of state-of-the-art policies like Hyperbolic caching and Leap prefetching, and unlocking application-specific benefits through region- and phase-aware policy specialization. View details
    Preview abstract Generative Artificial Intelligence (AI), particularly Large Language Models (LLMs), have demonstrated significant potential in clinical reasoning skills such as history-taking and differential diagnosis generation—critical aspects of medical education. This work explores how LLMs can augment medical curricula through interactive learning. We conducted a participatory design process with medical students, residents and medical education experts to co-create an AI-powered tutor prototype for clinical reasoning. As part of the co-design process, we conducted a qualitative user study, investigating learning needs and practices via interviews, and conducting concept evaluations through interactions with the prototype. Findings highlight the challenges learners face in transitioning from theoretical knowledge to practical application, and how an AI tutor can provide personalized practice and feedback. We conclude with design considerations, emphasizing the importance of context-specific knowledge and emulating positive preceptor traits, to guide the development of AI tools for medical education. View details
    Not Like Us, Hunty: Measuring Perceptions and Behavioral Effects of Minoritized Anthropomorphic Cues in LLMs
    Jeffrey Basoah
    Daniel Chechelnitsky
    Tao Long
    Katharina Reinecke
    Chrysoula Zerva
    Kaitlyn Zhou
    Maarten Sap
    Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, ACM (2025), pp. 710-745
    Preview abstract As large language models (LLMs) increasingly adapt and personalize to diverse sets of users, there is an increased risk of systems appropriating sociolects, i.e., language styles or dialects that are associated with specific minoritized lived experiences (e.g., African American English, Queer slang). In this work, we examine whether sociolect usage by a LLM agent affects user reliance on its outputs and user perception (satisfaction, frustration, trust, and social presence). We designed and conducted user studies where 498 African American English (AAE) speakers and 487 Queer slang speakers performed a set of question-answering tasks with LLM-based suggestions in either standard American English (SAE) or their self-identified sociolect. Our findings showed that sociolect usage by LLMs influenced both reliance and perceptions, though in some surprising ways. Results suggest that both AAE and Queer slang speakers relied more on the SAELM, and had more positive perceptions of the SAELM. Yet, only Queer slang speakers felt more social presence from the QSLM over the SAE one, whereas only AAE speakers preferred and trusted the SAELM over the AAE one. These findings emphasize the need to test for behavioral outcomes rather than simply assume that personalization would lead to a better and safer reliance outcome. They also highlight the nuanced dynamics of minoritized language in machine interactions, underscoring the need for LLMs to be carefully designed to respect cultural and linguistic boundaries while fostering genuine user engagement and trust. View details
    Fine-grained Measurement of Vehicle Delay Fairness
    Eliav Buchnik
    Tom Kalvari
    Jack Haddad
    Dan Karliner
    Danny Veikherman
    Ron Tsibulsky
    Shai Ferster
    Ori Rottenstreich
    2025
    Preview abstract Optimizing signal timing in traffic lights helps to improve traffic flow and reduce emissions through reducing delays. At intersections, vehicles from different movements observe different delays impacted by the traffic light plan. This paper analyzes delay fairness among various vehicles at intersections. We refer to three cities: Rio de Janeiro, Hamburg and Seattle with a total number of over 5100 intersections. We present an intuitive methodology to compute delay fairness based on Gini index, a common fairness measure in economics. We evaluate the fairness based on real traffic data and provide insights on the relationship of fairness with day hours and traffic demand. We also examine real changes in traffic light plans that occurred in practice to check whether improving delay is often aligned with increasing fairness. View details
    RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation
    Aviv Slobodkin
    Hagai Taitelbaum
    Brian Gordon
    Michal Sokolik
    Almog Gueta
    Royi Rassin
    Dani Lischinski
    2025
    Preview abstract Subject-driven text-to-image (T2I) generation aims to produce images that align with a given textual description, while preserving the visual identity from a referenced subject image. Despite its broad downstream applicability - ranging from enhanced personalization in image generation to consistent character representation in video rendering - progress in this field is limited by the lack of reliable automatic evaluation. Existing methods either assess only one aspect of the task (i.e., textual alignment or subject preservation), misalign with human judgments, or rely on costly API-based evaluation. To address this gap, we introduce RefVNLI, a cost-effective metric that evaluates both textual alignment and subject preservation in a single run. Trained on a large-scale dataset derived from video-reasoning benchmarks and image perturbations, RefVNLI outperforms or statistically matches existing baselines across multiple benchmarks and subject categories (e.g., Animal, Object), achieving up to 6.4-point gains in textual alignment and 5.9-point gains in subject preservation. View details
    Preview abstract Many persistent and dangerous software vulnerabilities, including memory safety violations and code injection, arise from a common root cause: Developers inadvertently violate the implicit safety preconditions of widely-used programming constructs. These preconditions—such as pointer validity, array-access bounds, and the trustworthy provenance of code fragments to be evaluated as SQL, HTML, or JavaScript—are traditionally the developer's responsibility to ensure. In complex systems, meeting these obligations often relies on non-local, whole-program invariants that are notoriously difficult to reason about correctly, leading to vulnerabilities that are difficult to detect after the fact. This article introduces Safe Coding, a collection of software design patterns and practices designed to cost-effectively provide a high degree of assurance against entire classes of such vulnerabilities. The core principle of Safe Coding is to shift responsibility for safety from individual developers to the programming language, software libraries, and frameworks. This is achieved by systematically eliminating the direct use of risky operations—those with complex safety preconditions—in application code. Instead, these operations are encapsulated within safe abstractions: modules with public APIs that are safe by design, whose implementations fully ensure all module-internal safety preconditions through a combination of local runtime checks and by elevating safety preconditions into type invariants. Safe Coding facilitates a modular and compositional approach to whole-program safety: Difficult reasoning is localized to the implementation of safe abstractions, which undergo focused expert scrutiny. The composition of these abstractions with the majority of the codebase (which is kept free of risky operations) is then automatically verified by the language’s type checker. This form of compositional reasoning, drawing from patterns used in formal software verification, can be viewed as a semi-formal approach that balances rigor with broad applicability to large industrial codebases. We discuss the successful application of these practices at Google, where they have nearly eliminated vulnerabilities such as Cross-Site Scripting (XSS) and SQL injection, and their critical role in ensuring memory safety in Rust, collectively demonstrating a favorable cost-assurance tradeoff for achieving software safety at scale. View details
    Triaging mammography with artificial intelligence: an implementation study
    Sarah M. Friedewald
    Sunny Jansen
    Fereshteh Mahvar
    Timo Kohlberger
    David V. Schacht
    Sonya Bhole
    Dipti Gupta
    Scott Mayer McKinney
    Stacey Caron
    David Melnick
    Mozziyar Etemadi
    Samantha Winter
    Alejandra Maciel
    Luca Speroni
    Martha Sevenich
    Arnav Agharwal
    Rubin Zhang
    Gavin Duggan
    Shiro Kadowaki
    Atilla Kiraly
    Jie Yang
    Basil Mustafa
    Krish Eswaran
    Shravya Shetty
    Breast Cancer Research and Treatment (2025)
    Preview abstract Purpose Many breast centers are unable to provide immediate results at the time of screening mammography which results in delayed patient care. Implementing artificial intelligence (AI) could identify patients who may have breast cancer and accelerate the time to diagnostic imaging and biopsy diagnosis. Methods In this prospective randomized, unblinded, controlled implementation study we enrolled 1000 screening participants between March 2021 and May 2022. The experimental group used an AI system to prioritize a subset of cases for same-visit radiologist evaluation, and same-visit diagnostic workup if necessary. The control group followed the standard of care. The primary operational endpoints were time to additional imaging (TA) and time to biopsy diagnosis (TB). Results The final cohort included 463 experimental and 392 control participants. The one-sided Mann-Whitney U test was employed for analysis of TA and TB. In the control group, the TA was 25.6 days [95% CI 22.0–29.9] and TB was 55.9 days [95% CI 45.5–69.6]. In comparison, the experimental group's mean TA was reduced by 25% (6.4 fewer days [one-sided 95% CI > 0.3], p<0.001) and mean TB was reduced by 30% (16.8 fewer days; 95% CI > 5.1], p=0.003). The time reduction was more pronounced for AI-prioritized participants in the experimental group. All participants eventually diagnosed with breast cancer were prioritized by the AI. Conclusions Implementing AI prioritization can accelerate care timelines for patients requiring additional workup, while maintaining the efficiency of delayed interpretation for most participants. Reducing diagnostic delays could contribute to improved patient adherence, decreased anxiety and addressing disparities in access to timely care. View details
    A Novel CI Coding Strategy Based on a Cochlear Model and Deep Neural Network
    Maryam Hosseini
    Tim Brochier
    Zachary Smith
    Brett Swanson
    Andrew Vandali
    Alan Kan
    Fadwa Alnafjan
    Kat Fernandez
    Conference on Implantable Auditory Prostheses 2025
    Preview abstract Objective: Many CI recipients face difficulties in understanding speech in noisy environments and express frustration with the quality of music. This may be partly due to the simple filter banks used in current CI technology, which do not fully replicate the natural processes of the cochlea. This project aims to improve CI perception by more accurately mimicking the responses of the auditory nerve. Method: Audio signals were applied to CARFAC (Cascade of Asymmetric Resonators with Fast-Acting Compression) [1] to produce a representation of the auditory nerve response, known as a normal hearing (NH) “neurogram”. The NH neurogram was down-sampled and applied to a deep neural network (DNN) to produce 22 electrode stimulation currents. These currents were applied to an electrical hearing (EH) model incorporating current spread, neural adaptation, and refractoriness, to produce a CI neurogram. The DNN was trained on sentences from the TIMIT database to minimise the difference between the NH and CI neurograms. Results: The CI neurograms produced by the CARFAC-DNN strategy were more similar to the NH neurograms than the CI neurograms produced by the Nucleus ACE strategy. Similarity was quantified by the structural similarity index and mean squared error. Conclusions: The CARFAC-DNN strategy may provide a more natural auditory nerve response than traditional CI sound coding strategies. A sound-booth study with CI recipients is planned. This work was funded by Google through the Australian Future Hearing Initiative. References: [1]  Lyon, R. F. (2017). Human and machine hearing. Cambridge University Press. View details
    Preview abstract Perch is a performant pre-trained model for bioacoustics. It was trained in supervised fashion, providing both off-the-shelf classification scores for thousands of vocalizing species as well as strong embeddings for transfer learning. In this new release, Perch 2.0, we expand from training exclusively on avian species to a large multi-taxa dataset. The model is trained with self-distillation using a prototype-learning classifier as well as a new source-prediction training criterion. Perch 2.0 obtains state-of-the-art performance on the BirdSet and BEANS benchmarks. It also outperforms specialized marine models on marine transfer learning tasks, despite having almost no marine training data. We present hypotheses as to why fine-grained species classification is a particularly robust pre-training task for bioacoustics. View details
    Preview abstract Large language models (LLMs), optimized through human feedback, have rapidly emerged as a leading paradigm for developing intelligent conversational assistants. However, despite their strong performance across many benchmarks, LLM-based agents might still lack conversational skills such as disambiguation -- when they are faced with ambiguity, they often overhedge or implicitly guess users' true intents rather than asking clarification questions. Under task-specific settings, high-quality conversation samples are often limited, constituting a bottleneck for LLMs' ability to learn optimal dialogue action policies. We propose Action-Based Contrastive Self-Training (ACT), a quasi-online preference optimization algorithm based on Direct Preference Optimization (DPO), that enables data-efficient dialogue policy learning in multi-turn conversation modeling. We demonstrate ACT's efficacy under data-efficient tuning scenarios, even when there is no action label available, using multiple real-world conversational tasks: tabular-grounded question-answering, machine reading comprehension, and AmbigSQL, a novel task for disambiguating information-seeking requests for complex SQL generation towards data analysis agents. Additionally, we propose evaluating LLMs' ability to function as conversational agents by examining whether they can implicitly recognize and reason about ambiguity in conversation. ACT demonstrates substantial conversation modeling improvements over standard tuning approaches like supervised fine-tuning and DPO. View details
    Preview abstract We consider the problem of auto-bidding in online advertising from the perspective of a single advertiser. The goal of the advertiser is to maximize their value under the Return-on-Spend (RoS) constraint, with performance measured in terms of \emph{regret} against the optimal offline solution that knows all queries a priori. Importantly, the value of the item is \textit{unknown} to the bidder ahead of time. The goal of the bidder is to quickly identify the optimal bid, while simultaneously satisfying budget and RoS constraints. Using a simple UCB-style algorithm, we provide the first result which achieves optimal regret and constraint violation for this problem. View details
    ×