Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 10827 publications
    Productionizing Quantum Mass Production
    Bill Huggins
    Nathan Wiebe
    arXiv for now (2026) (to appear)
    Preview abstract For many practical applications of quantum computing, the slowest and most costly steps involve coherently accessing classical data. We help address this challenge by applying mass production techniques, which can sometimes allow us to perform operations many times in parallel for a cost that is comparable to a single execution[1-3]. We combine existing mass-production results with modern approaches for loading classical data using ``quantum read-only memory.'' We show that quantum mass production techniques offer no benefit when we consider a cost model that focuses purely on the number of non-Clifford gates. However, analyzing the constant factors in a more nuanced cost model, we find that it may be possible to obtain a reduction in cost of an order or magnitude or more for a variety reasonably-sized fault-tolerant quantum algorithms. We present several applications of quantum mass-production techniques beyond naive parallelization, including a strategy for reducing the cost of serial calls to the same data loading step. View details
    FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration
    Diganta Misra
    Yanqi Luo
    Anjali Sridhar
    Justine Gehring
    Silvio Soares Ribeiro Junior
    2026
    Preview abstract AI coding assistants are rapidly becoming integral to modern software development. A key challenge in this space is the continual need to migrate and modernize codebases in response to evolving software ecosystems. Traditionally, such migrations have relied on rule-based systems and human intervention. With the advent of powerful large language models (LLMs), AI-driven agentic frameworks offer a promising alternative—but their effectiveness remains underexplored. In this paper, we introduce FreshBrew, a novel benchmark for evaluating AI-based agentic frameworks on project-level Java migrations. We benchmark several such frameworks, powered by state-of-the-art LLMs, and compare their performance against established rule-based tools. Our evaluation of AI agents on this benchmark of 228 repositories shows that the top-performing model, Gemini 2.5 Flash, can successfully migrate 56.5% of projects to JDK 17. Our empirical analysis reveals novel insights into the critical strengths and limitations of current agentic approaches, offering actionable insights into their real-world applicability. By releasing FreshBrew publicly upon acceptance, we aim to facilitate rigorous, reproducible evaluation and catalyze progress in AI-driven codebase modernization. View details
    “Does the cafe entrance look accessible? Where is the door?” Towards Geospatial AI Agents for Visual Inquiries
    Jared Hwang
    Zeyu Wang
    John S. O'Meara
    Xia Su
    William Huang
    Yang Zhang
    Alex Fiannaca
    ICCV'25 Workshop "Vision Foundation Models and Generative AI for Accessibility: Challenges and Opportunities" (2025)
    Preview abstract Interactive digital maps have revolutionized how people travel and learn about the world; however, they rely on preexisting structured data in GIS databases (e.g., road networks, POI indices), limiting their ability to address geovisual questions related to what the world looks like. We introduce our vision for Geo-Visual Agents—multimodal AI agents capable of understanding and responding to nuanced visual-spatial inquiries about the world by analyzing large-scale repositories of geospatial images, including streetscapes (e.g., Google Street View), place-based photos (e.g., TripAdvisor, Yelp), and aerial imagery (e.g., satellite photos) combined with traditional GIS data sources. We define our vision, describe sensing and interaction approaches, provide three exemplars, and enumerate key challenges and opportunities for future work. View details
    Correspondance: Wearing a Fur Coat in the Summertime: Should Digital Pathology Redefine Medical Imaging?
    Kenneth Philbrick
    Brian Napora
    John Groth
    Mustafa Yousuf
    Journal of Pathology Informatics (2025)
    Preview abstract In response to recent critiques, members of DICOM Working Group 26 assert that DICOM is the robust and essential standard for digital pathology, actively facilitating interoperability and communication of medical images far beyond simple pixel data. They highlight successful global deployments and collaborations (like the recent Connectathon) demonstrating DICOM's proven ability to integrate WSI scanners, archives, viewers, and AI tools. Despite concerns, DICOM offers flexible metadata encoding, robust security features, and strong industry and regulatory support, making it indispensable for patient care. The authors advocate for continued investment in and adoption of DICOM to advance efficiency, accuracy, and patient safety in integrated healthcare systems. View details
    Preview abstract Recent advancements in Large Language Models (LLMs) have created new opportunities to enhance performance on complex reasoning tasks by leveraging test-time computation. However, existing scaling methods have key limitations: parallel methods like repeated sampling are often inefficient and quickly saturate, while sequential methods like SELF-REFINE struggle to improve after a few rounds. Although combining these approaches shows promise, current methods require fine-tuned reward and revision models. This paper proposes Self-Enhanced Test-Time Scaling (SETS), a simple yet effective approach that overcomes these limitations by strategically combining parallel and sequential techniques and fully leveraging LLMs' self-improvement abilities. SETS exploits the inherent self-verification and self-correction capabilities of LLMs, unifying sampling, verification, and correction within a single framework. This facilitates efficient and scalable test-time computation for enhanced performance on complex tasks without any model training. Our comprehensive experimental results on challenging benchmarks spanning planning, reasoning, math, and coding demonstrate that SETS achieves significant performance improvements and more advantageous test-time scaling behavior than the alternatives. View details
    Preview abstract Delay monitoring is a commonly arising problem in applications such as queue management systems, scheduling, and traffic monitoring. Motivated by such applications, we formulate a queue monitoring problem, where there is a FIFO queue with arbitrary arrivals and departures, and a server needs to monitor the length of a queue by using (decentralized) pings from packets in the queue. Packets can send pings informing the server about the number of packets ahead of them in the queue. Via novel online algorithms and lower bounds, we tightly characterize the trade-off between the number of pings sent and the accuracy of the server's real time estimates. Further, our approximate estimates can be made to be accurate to an arbitrary precision. View details
    Preview abstract This paper adopts a Usage-Based Construction Grammar perspective to compare human- and AI-generated language, focusing on Verb-Argument Constructions (VACs) as a lens for analysis. Specifically, we examine solicited advice texts in two domains—Finance and Medicine—produced by humans and ChatGPT across different GPT models (3.5, 4, and 4o) and interfaces (3.5 Web vs. 3.5 API). Our findings reveal broad consistency in the frequency and distribution of the most common VACs across human- and AI-generated texts, though ChatGPT exhibits a slightly higher reliance on the most frequent constructions. A closer examination of the verbs occupying these constructions uncovers significant differences in the meanings conveyed, with a notable growth away from human-like language production in macro level perspectives (e.g., length) and towards humanlike verb-VAC patterns with newer models. These results underscore the potential of VACs as a powerful tool for analyzing AI-generated language and tracking its evolution over time. View details
    Preview abstract Note this is a re-submission of a previously approved ITP. The previous approval was conditional for a journal pre-sub enquiry only and we are submitting a new ITP for the preprint of the paper. AI models have been proposed for hypothesis generation, but testing their ability to drive high-impact research is challenging, since an AI-generated hypothesis can take decades to validate. Here, we challenge the ability of a recently developed LLM-based platform to generate high-level hypotheses by posing a question that took years to resolve experimentally but remained unpublished: How could capsid-forming phage-inducible chromosomal islands (cf-PICIs) spread across bacterial species? Remarkably, the AI’s top- ranked hypothesis matched our experimentally confirmed mechanism: cf-PICIs hijack diverse phage tails to expand their host range. We critically assess the AI’s five highest- ranked hypotheses, showing that some opened new research avenues in our laboratories. We benchmark its performance against other LLMs and outline best practices for integrating AI into scientific discovery. Our findings suggest that AI can act not just as a computational tool, but as a creative engine, accelerating discovery and reshaping how we generate and test scientific hypotheses. View details
    Not Like Us, Hunty: Measuring Perceptions and Behavioral Effects of Minoritized Anthropomorphic Cues in LLMs
    Jeffrey Basoah
    Daniel Chechelnitsky
    Tao Long
    Katharina Reinecke
    Chrysoula Zerva
    Kaitlyn Zhou
    Maarten Sap
    Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, ACM (2025), pp. 710-745
    Preview abstract As large language models (LLMs) increasingly adapt and personalize to diverse sets of users, there is an increased risk of systems appropriating sociolects, i.e., language styles or dialects that are associated with specific minoritized lived experiences (e.g., African American English, Queer slang). In this work, we examine whether sociolect usage by a LLM agent affects user reliance on its outputs and user perception (satisfaction, frustration, trust, and social presence). We designed and conducted user studies where 498 African American English (AAE) speakers and 487 Queer slang speakers performed a set of question-answering tasks with LLM-based suggestions in either standard American English (SAE) or their self-identified sociolect. Our findings showed that sociolect usage by LLMs influenced both reliance and perceptions, though in some surprising ways. Results suggest that both AAE and Queer slang speakers relied more on the SAELM, and had more positive perceptions of the SAELM. Yet, only Queer slang speakers felt more social presence from the QSLM over the SAE one, whereas only AAE speakers preferred and trusted the SAELM over the AAE one. These findings emphasize the need to test for behavioral outcomes rather than simply assume that personalization would lead to a better and safer reliance outcome. They also highlight the nuanced dynamics of minoritized language in machine interactions, underscoring the need for LLMs to be carefully designed to respect cultural and linguistic boundaries while fostering genuine user engagement and trust. View details
    Inside-Out: Hidden Factual Knowledge in LLMs
    Eyal Ben David
    Eran Ofek
    Hadas Orgad
    Zorik Gekhman
    Roi Reichart
    Yonatan Belinkov
    2025
    Preview abstract This work presents a framework for assessing whether large language models (LLMs) encode more factual knowledge in their parameters than what they express in their outputs. While a few studies hint at this possibility, none has clearly defined or demonstrated this phenomenon. We first propose a formal definition of knowledge, quantifying it for a given question as the fraction of correct-incorrect answer pairs where the correct one is ranked higher. This gives rise to external and internal knowledge, depending on the information used to score individual answer candidates: either the model’s observable token-level probabilities or its intermediate computations. Hidden knowledge arises when internal knowledge exceeds external knowledge. We then present a case study, applying this framework to three popular open-weights LLMs in a closed-book QA setup. Our results indicate that: (1) LLMs consistently encode more factual knowledge internally than what they express externally, with an average gap of 40%. (2) Surprisingly, some knowledge is so deeply hidden that a model can internally know an answer perfectly, yet fail to generate it even once, despite large-scale repeated sampling of 1,000 answers. This reveals fundamental limitations in the generation capabilities of LLMs, which (3) puts a practical constraint on scaling test-time compute via repeated answer sampling in closed-book QA: significant performance improvements remain inaccessible because some answers are practically never sampled, yet if they were, we would be guaranteed to rank them first. View details
    Preview abstract Augmenting LLMs with context leads to improved performance across many applications. Despite much research on Retrieval Augmented Generation (RAG) systems, an open question is whether errors arise because LLMs fail to utilize the context from retrieval or the context itself is insufficient to answer the query. To shed light on this, we develop a new notion of sufficient context, along with a way to classify instances that have enough information to answer the query. We then use sufficient context to analyze several models and datasets. By stratifying errors based on context sufficiency, we find that proprietary LLMs (Gemini, GPT, Claude) excel at answering queries when the context is sufficient, but often output incorrect answers instead of abstaining when the context is not. On the other hand, open-source LLMs (Llama, Mistral, Gemma) hallucinate or abstain often, even with sufficient context. We further categorize cases when the context is useful, and improves accuracy, even though it does not fully answer the query and the model errs without the context. Building on our findings, we explore ways to reduce hallucinations in RAG systems, including a new selective generation method that leverages sufficient context information for guided abstention. Our method improves the fraction of correct answers among times where the model responds by 2--10% for Gemini, GPT, and Gemma. View details
    The ASPLOS 2025 / EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning
    Pratik Fegade
    Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (2025), pp. 5-17
    Preview abstract A chief enabler of large-scale deep learning is the distribution of computation across multiple interconnected hardware accelerators. In order to unlock the maximum possible performance, a compiler must first select a reasonable strategy to parallelize a model's operations. Since neural network architectures admit multiple flavors of parallelism, determining the proper strategy for each instruction is a critical (albeit non-trivial) task. To solicit new ideas toward solving this challenging combinatorial optimization problem, we organized the ASPLOS 2025 / EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning, a multi-month competition focused on advancing the state-of-the-art for model partitioning algorithms. In this paper, we offer a retrospective of this event, including the basic problem formulation, key challenges & opportunities, our new benchmark suite, and the quality of submissions received. View details
    Preview abstract Creativity in software development is frequently overlooked, specifically in the design of developer tools which often focus on productivity. This is likely because creativity is not always seen as a goal in software engineering; in part, this can be explained by the unique way in which software engineers relate to creativity as centered around reusability rather than novelty. However, creativity is a critical aspect of software engineering, and importantly, there is a clear possibility for AI to impact creativity in both positive or negative ways. In this article, we explore the differences in goals for designing AI tools for productivity compared to creativity and propose strategies to elevate creativity in the software engineering workflow. Specifically, we apply seamful design to AI powered software development to consider the role of seamfulness in software development workflows as a way to support creativity. View details
    Preview abstract As one of the world's most populous countries, with 700 languages spoken, Indonesia is behind in terms of NLP progress. We introduce Lorax, a benchmark that focuses on low-resource languages of Indonesia and covers 6 diverse tasks: reading comprehension, open-domain QA, language inference, causal reasoning, translation, and cultural QA. We cover 20 languages, with the addition of 2 politeness registers for 3 of the languages. As a benchmark is essential to the progress itself, this data should provide a useful contribution to the community. We benchmark a diverse set of multilingual and region-focused LLMs and found that this benchmark is challenging. We note a visible discrepancy between performance in Indonesia and other languages, especially the low-resource ones. There is no clear lead when using a region-specific model as opposed to the general multilingual model. Lastly, we show that a change in register affects model performance, especially with registers not commonly found in social media, such as high-level politeness 'Krama' Javanese. View details
    Preview abstract We present new efficient algorithms for high-dimensional calibration via reduction to the TreeSwap algorithm of Dagan et al. View details
    ×