Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 11058 publications
    Preview abstract Semantic data models express high-level business concepts and metrics, capturing the business logic needed to query a database correctly. Most data modeling solutions are built as layers above SQL query engines, with bespoke query languages or APIs. The layered approach means that semantic models can’t be used directly in SQL queries. This paper focuses on an open problem in this space – can we define semantic models in SQL, and make them naturally queryable in SQL? In parallel, graph query is becoming increasingly popular, including in SQL. SQL/PGQ extends SQL with an embedded subset of the GQL graph query language, adding property graph views and making graph traversal queries easy. We explore a surprising connection: semantic data models are graphs, and defining graphs is a data modeling problem. In both domains, users start by defining a graph model, and need query language support to easily traverse edges in the graph, which means doing joins in the underlying data. We propose some useful SQL extensions that make it easier to use higher-level data model abstractions in queries. Users can define a “semantic data graph” view of their data, encapsulating the complex business logic required to query the underlying tables correctly. Then they can query that semantic graph model easily with SQL. Our SQL extensions are useful independently, simplifying many queries – particularly, queries with joins. We make declared foreign key relationships usable for joins at query time – a feature that seems obvious but is notably missing in standard SQL. In combination, these extensions provide a practical approach to extend SQL incrementally, bringing semantic modeling and graph query together with the relational model and SQL. View details
    Productionizing Quantum Mass Production
    Bill Huggins
    Nathan Wiebe
    arXiv for now (2026) (to appear)
    Preview abstract For many practical applications of quantum computing, the slowest and most costly steps involve coherently accessing classical data. We help address this challenge by applying mass production techniques, which can sometimes allow us to perform operations many times in parallel for a cost that is comparable to a single execution[1-3]. We combine existing mass-production results with modern approaches for loading classical data using ``quantum read-only memory.'' We show that quantum mass production techniques offer no benefit when we consider a cost model that focuses purely on the number of non-Clifford gates. However, analyzing the constant factors in a more nuanced cost model, we find that it may be possible to obtain a reduction in cost of an order or magnitude or more for a variety reasonably-sized fault-tolerant quantum algorithms. We present several applications of quantum mass-production techniques beyond naive parallelization, including a strategy for reducing the cost of serial calls to the same data loading step. View details
    CrossCheck: Input Validation for WAN Control Systems
    Rishabh Iyer
    Isaac Keslassy
    Sylvia Ratnasamy
    Networked Systems Design and Implementation (NSDI) (2026) (to appear)
    Preview abstract We present CrossCheck, a system that validates inputs to the Software-Defined Networking (SDN) controller in a Wide Area Network (WAN). By detecting incorrect inputs—often stemming from bugs in the SDN control infrastructure—CrossCheck alerts operators before they trigger network outages. Our analysis at a large-scale WAN operator identifies invalid inputs as a leading cause of major outages, and we show how CrossCheck would have prevented those incidents. We deployed CrossCheck as a shadow validation system for four weeks in a production WAN, during which it accurately detected the single incident of invalid inputs that occurred while sustaining a 0% false positive rate under normal operation, hence imposing little additional burden on operators. In addition, we show through simulation that CrossCheck reliably detects a wide range of invalid inputs (e.g., detecting demand perturbations as small as 5% with 100% accuracy) and maintains a near-zero false positive rate for realistic levels of noisy, missing, or buggy telemetry data (e.g., sustaining zero false positives with up to 30% of corrupted telemetry data). View details
    FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration
    Diganta Misra
    Yanqi Luo
    Anjali Sridhar
    Justine Gehring
    Silvio Soares Ribeiro Junior
    2026
    Preview abstract AI coding assistants are rapidly becoming integral to modern software development. A key challenge in this space is the continual need to migrate and modernize codebases in response to evolving software ecosystems. Traditionally, such migrations have relied on rule-based systems and human intervention. With the advent of powerful large language models (LLMs), AI-driven agentic frameworks offer a promising alternative—but their effectiveness remains underexplored. In this paper, we introduce FreshBrew, a novel benchmark for evaluating AI-based agentic frameworks on project-level Java migrations. We benchmark several such frameworks, powered by state-of-the-art LLMs, and compare their performance against established rule-based tools. Our evaluation of AI agents on this benchmark of 228 repositories shows that the top-performing model, Gemini 2.5 Flash, can successfully migrate 56.5% of projects to JDK 17. Our empirical analysis reveals novel insights into the critical strengths and limitations of current agentic approaches, offering actionable insights into their real-world applicability. By releasing FreshBrew publicly upon acceptance, we aim to facilitate rigorous, reproducible evaluation and catalyze progress in AI-driven codebase modernization. View details
    Preview abstract How many T gates are needed to approximate an arbitrary n-qubit quantum state to within a given precision ϵ? Improving prior work of Low, Kliuchnikov and Schaeffer, we show that the optimal asymptotic scaling is Θ(sqrt{2^n log(1/ε)} + log(1/ε)) if we allow an unlimited number of ancilla qubits. We also show that this is the optimal T-count for implementing an arbitrary diagonal n-qubit unitary to within error ϵ. We describe an application to batched synthesis of single-qubit unitaries: we can approximate a tensor product of m = O(log log(1/ϵ)) arbitrary single-qubit unitaries to within error ϵ with the same asymptotic T-count as is required to approximate just one single-qubit unitary. View details
    Quasiparticle-induced decoherence of a driven superconducting qubit
    Mykola Kishmar
    Pavel Kurilovich
    Vlad Kurilovich
    Thomas Connolly
    Andrey Klots
    Igor Aleiner
    arXiv (2025)
    Preview abstract We develop a theory for two quasiparticle-induced decoherence mechanisms of a driven superconducting qubit. In the first mechanism, an existing quasiparticle (QP) tunnels across the qubit’s Josephson junction while simultaneously absorbing a qubit excitation and one (or several) photons from the drive. In the second mechanism, a qubit transition occurs during the non-linear absorption process converting multiple drive quanta into a pair of new QPs. Both mechanisms can remain significant in gap engineered qubits whose coherence is insensitive to QPs without the drive. Our theory establishes a fundamental limitation on fidelity of the microwave qubit operations—such as readout and gates—stemming from QPs. View details
    Preview abstract Despite the advent of legislation such as the General Data Protection Regulation (GDPR) with its associated "Right to be Forgotten" (RTBF), few, if any, studies have measured user reactions to realistic edge cases with public-interest content. Surveying both users covered by and excluded from RTBF, this vignette-based survey experiment sought to better understand how users think of delisting content from search engine results and what factors influence user perceptions. While leaving information accessible in search engine results generally leads to warmer feelings towards those search engines than delisting it, we find that users do prefer different outcomes depending on contextual elements specific to given cases. We also find that whether a country has active RTBF legislation does seem to be associated with both knowledge and attitudes about RTBF, but is unlikely to explain all of it. These results indicate a complex context around removing public-interest content from search engines’ results; it is essential that experts sensitive to local context perform the review in order to ensure that removal requests are handled in a way that meets users’ expectations. View details
    Inside-Out: Hidden Factual Knowledge in LLMs
    Eyal Ben David
    Eran Ofek
    Hadas Orgad
    Zorik Gekhman
    Roi Reichart
    Yonatan Belinkov
    2025
    Preview abstract This work presents a framework for assessing whether large language models (LLMs) encode more factual knowledge in their parameters than what they express in their outputs. While a few studies hint at this possibility, none has clearly defined or demonstrated this phenomenon. We first propose a formal definition of knowledge, quantifying it for a given question as the fraction of correct-incorrect answer pairs where the correct one is ranked higher. This gives rise to external and internal knowledge, depending on the information used to score individual answer candidates: either the model’s observable token-level probabilities or its intermediate computations. Hidden knowledge arises when internal knowledge exceeds external knowledge. We then present a case study, applying this framework to three popular open-weights LLMs in a closed-book QA setup. Our results indicate that: (1) LLMs consistently encode more factual knowledge internally than what they express externally, with an average gap of 40%. (2) Surprisingly, some knowledge is so deeply hidden that a model can internally know an answer perfectly, yet fail to generate it even once, despite large-scale repeated sampling of 1,000 answers. This reveals fundamental limitations in the generation capabilities of LLMs, which (3) puts a practical constraint on scaling test-time compute via repeated answer sampling in closed-book QA: significant performance improvements remain inaccessible because some answers are practically never sampled, yet if they were, we would be guaranteed to rank them first. View details
    Preview abstract Building on the linear programming approach to competitive equilibrium pricing, we develop a general method for constructing iterative auctions that achieve Vickrey-Clarke-Groves (VCG) outcomes. We show how to transform a linear program characterizing competitive equilibrium prices into one that characterizes universal competitive equilibrium (UCE) prices, which elicit precisely the information needed to compute VCG payments. By applying a primal-dual algorithm to these transformed programs, we derive iterative auctions that maintain a single price path, eliminating the overhead and incentive problems associated with multiple price paths used solely for payment calculations. We demonstrate the versatility of our method by developing novel UCE auctions for multi-unit settings and deriving an iterative UCE variant of the Product-Mix auction. The resulting auctions combine the transparency of iterative price discovery with the efficiency and incentive properties of the VCG mechanism. View details
    VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis
    Enric Corona
    Andrei Zanfir
    Cristian Sminchisescu
    Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) (2025)
    Preview abstract We propose VLOGGER, a method for audio-driven human video generation from a single input image of a person, which builds on the success of recent generative diffusion models. Our method consists of 1) a stochastic human-to3d-motion diffusion model, and 2) a novel diffusion-based architecture that augments text-to-image models with both spatial and temporal controls. This supports the generation of high quality video of variable length, easily controllable through text or speech via high-level representations of human faces and bodies. In contrast to previous work, our method does not require training for each person, does not rely on face detection and cropping, generates the complete image (not just the face or the lips), and considers a broad spectrum of scenarios (e.g. visible torso or diverse subject identities) that are critical to correctly synthesize humans who communicate. We also curate MENTOR, a new and diverse dataset with 3d pose and expression annotations, one order of magnitude larger than previous ones (800,000 identities) and with dynamic gestures, where we train and ablate our main technical contributions. VLOGGER outperforms state-of-the-art methods in three public benchmarks, considering image quality, identity preservation and temporal consistency while also generating upperbody gestures. We analyze the performance of VLOGGER with respect to multiple diversity metrics, showing that our architectural choices and the use of MENTOR benefit training a fair and unbiased model at scale. Finally we show applications in video editing and personalization. View details
    Scaling Laws for Downstream Task Performance in Machine Translation
    Natalia Ponomareva
    Hussein Hazimeh
    Sanmi Koyejo
    International Conference on Learning Representations (ICLR) (2025) (to appear)
    Preview abstract Scaling laws provide important insights that can guide the design of large language models (LLMs). Existing work has primarily focused on studying scaling laws for pretraining (upstream) loss. However, in transfer learning settings, in which LLMs are pretrained on an unsupervised dataset and then finetuned on a downstream task, we often also care about the downstream performance. In this work, we study the scaling behavior in a transfer learning setting, where LLMs are finetuned for machine translation tasks. Specifically, we investigate how the choice of the \emph{pretraining} data and its size affect downstream performance (translation quality) as judged by: downstream cross-entropy and translation quality metrics such as BLEU and COMET scores. Our experiments indicate that the size of the finetuning dataset and the distribution alignment between the pretraining and downstream data significantly influence the scaling behavior. With sufficient alignment, both downstream cross-entropy and translation quality scores improve monotonically with more pretraining data. In such cases, we show that it is possible to predict the downstream translation quality metrics with good accuracy using a log-law. However, there are cases where moderate misalignment causes the downstream translation scores to fluctuate or get worse with more pretraining, whereas downstream cross-entropy monotonically improves. By analyzing these, we provide new practical insights for choosing appropriate pretraining data. View details
    Preview abstract Specific quantum algorithms exist to—in theory— break elliptic curve cryptographic protocols. Implementing these algorithms requires designing quantum circuits that perform elliptic curve arithmetic. To accurately judge a cryptographic protocol’s resistance against future quantum computers, researchers figure out minimal resource-count circuits for performing these operations while still being correct. To assure the correctness of a circuit, it is integral to restore all ancilla qubits used to their original states. Failure to do so could result in decoherence of the computation’s final result. Through rigorous classical simulation and unit testing, I surfaced four inconsistencies in the state-ofthe-art quantum circuit for elliptic curve point addition where the circuit diagram states the qubits are returned in the original (|0⟩) state, but the intermediate values are not uncomputed. I provide fixes to the circuit without increasing the leading-order gate cost. View details
    Preview abstract Integrating tools like Code Interpreter and Search has significantly improved Large Language Models (LLMs) reasoning, as shown by leading models such as OpenAI's ChatGPT Agent, Google's Gemini-Pro, and XAI's Grok4. However, the research community still lacks practical guidance on fully leveraging these tools. The main challenge lies in finding an effective method to fully exploit the benefits of textual reasoning, coding, and searching when facing distinctive questions. To address this, we propose an ensemble-based framework that runs multiple agents in parallel, each exploring different answer paths with distinct tool-use strategies. Agents iteratively share and refine their answers by considering the original question and previous responses. Our proposed method Tool-Use Mixture (TUMIX) achieves significant gains over other representative tool-augmented test-time scaling methods such as Self-MoA, Symbolic-MoE, DEI, SciMaster, and GSA. With near equal inference costs, TUMIX delivers an average +3.55% accuracy improvement over the best baseline on Gemini-2.5-Pro and Gemini-2.5-Flash across key reasoning benchmarks (HLE, GPQA, AIME 24&25), where coding and search can effectively support reasoning when applied properly. We find that agent diversity and quality are crucial, and can be further improved by querying LLMs to automatically optimize agent designs. To reduce costs, TUMIX halts refinement once sufficient confidence is reached, preserving nearly the same performance at just 49% of the inference cost. With further scaling, TUMIX can achieve even higher performance, though at substantially greater cost. View details
    Preview abstract Background: Providers spend a large percentage of their day using electronic health record (EHR) technology and frequently report frustration when EHR tasks are time-consuming and effortful. To solve these challenges, artificial intelligence (AI)–based enhancements to EHR technology are increasingly being deployed. However, AI-based implementations for EHR features often lack user-centered evaluation. Objective: This study evaluates, using a user-centered approach, the implementation of an AI-powered search and clinical discovery tool within an EHR system. Methods: We conducted a mixed methods study consisting of interviews, observations, and surveys for 5 months. Results: High adoption rates for the AI-based features (163/176, 93% users after 3 months) and significant increases across key metrics, including user satisfaction (U=49; P<.001) and perception of time saved (U=49; P<.001), demonstrated that the AI-based features were not only successfully integrated into various clinical workflows but also improved the user experience for clinicians. Conclusions: Our results underscore the feasibility and effectiveness of using a user-centered approach for the deployment of clinical AI tools. High adoption rates and positive user experiences were driven by our user-centered research program, which emphasized close collaboration with users, rapid incorporation of feedback, and tailored user training. This study program can be used as a starting framework for the design and integration of human-centered research methods for AI tool deployment in clinical settings. View details
    Parallelising Lazy Clause Generation with Trail Sharing
    Toby Davies
    Peter J Stuckey
    Integration of Constraint Programming, Artificial Intelligence, and Operations Research (CPAIOR 2025), Springer Nature Switzerland, Cham, pp. 205-221
    Preview abstract We investigate the effectiveness of splitting the search space in parallelising the state-of-the-art CP-SAT solver. One of the key barriers to effective search-space splitting in learning solvers is the generated sub-problems are not independent, leading to substantial communication-related overhead, substantial redundant work, or both. Our contributions attempt to mitigate this issue: job reassignment; and trail sharing. Jobs (sub-trees) are reassigned to new workers if the clause database of the currently assigned worker appears ill-suited to the region of the search-space, when doing so workers can share some of the state from their local trail. We argue a trail prefix can be viewed as a lossy compressed representation of much of the relevant information a worker has learnt about a job, this information can be exploited by the next worker assigned the same subtree to avoid some redundant work. We show these approaches complement standard portfolio and clause-sharing approaches, improving CP-SAT’s performance on MiniZinc challenge benchmarks with a moderate number of worker threads. To enable these approaches, we also introduce “Buffered Work Stealing,” which can be parameterised to emulate the two main existing approaches to search-space splitting in the literature: Work Stealing and Embarassingly Parallel Search, as well as an intermediate configuration between these two extremes that slightly outperforms both. View details
    ×