Publications
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Sort By
1 - 15 of 10456 publications
Necro-reaper: Pruning away Dead Memory Traffic in Warehouse-Scale Computers
Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Association for Computing Machinery (2025)
Preview abstract
Memory bandwidth is emerging as a critical bottleneck in warehouse-scale computing (WSC). This work reveals that a significant portion of memory traffic in WSC is surprisingly unnecessary, consisting of unnecessary writebacks of deallocated data and fetches of uninitialized data. This issue is particularly acute in WSC, where short-lived heap allocations bigger than a cache line are prevalent. To address this problem, this work proposes a pragmatic approach tailored to WSC. Leveraging the existing WSC ecosystem of vertical integration, profile-guided compilation flows, and customized memory allocators, this work presents Necro-reaper, a novel software/hardware co-design that avoids dead memory traffic without requiring the hardware tracking of prior work. New ISA instructions enable the hardware to avoid unnecessary dead traffic, while extended software components, including a profile-guided compiler and memory allocator, optimize the utilization of these instructions. Evaluation across a diverse set of 10 WSC workloads demonstrates that Necro-reaper achieves a geomean memory traffic reduction of 26% and a geomean IPC increase of 6%.
View details
Oculomics: Current Concepts and Evidence
Zhuoting Zhu
Yueye Wang
Ziyi Qi
Wenyi Hu
Xiayin Zhang
Siegfried Wagner
Yujie Wang
An Ran Ran
Joshua Ong
Ethan Waisberg
Mouayad Masalkhi
Alex Suh
Yih Chung Tham
Carol Y. Cheung
Xiaohong Yang
Honghua Yu
Zongyuan Ge
Wei Wang
Bin Sheng
Andrew G. Lee
Alastair Denniston
Peter van Wijngaarden
Pearse Keane
Ching-Yu Cheng
Mingguang He
Tien Yin Wong
Progress in Retinal and Eye Research (2025)
Preview abstract
The eye provides novel insights into general health, as well as pathogenesis and development of systemic diseases. In the past decade, growing evidence has demonstrated that the eye's structure and function mirror multiple systemic health conditions, especially in cardiovascular diseases, neurodegenerative disorders, and kidney impairments. This has given rise to the field of oculomics- the application of ophthalmic biomarkers to understand mechanisms, detect and predict disease. The development of this field has been accelerated by three major advances: 1) the availability and widespread clinical adoption of high-resolution and non-invasive ophthalmic imaging (“hardware”); 2) the availability of large studies to interrogate associations (“big data”); 3) the development of novel analytical methods, including artificial intelligence (AI) (“software”). Oculomics offers an opportunity to enhance our understanding of the interplay between the eye and the body, while supporting development of innovative diagnostic, prognostic, and therapeutic tools. These advances have been further accelerated by developments in AI, coupled with large-scale linkage datasets linking ocular imaging data with systemic health data. Oculomics also enables the detection, screening, diagnosis, and monitoring of many systemic health conditions. Furthermore, oculomics with AI allows prediction of the risk of systemic diseases, enabling risk stratification, opening up new avenues for prevention or individualized risk prediction and prevention, facilitating personalized medicine. In this review, we summarise current concepts and evidence in the field of oculomics, highlighting the progress that has been made, remaining challenges, and the opportunities for future research.
View details
Towards Conversational AI for Disease Management
Khaled Saab
David Stutz
Kavita Kulkarni
Sara Mahdavi
Joelle Barral
James Manyika
Ryutaro Tanno
Adam Rodman
arXiv (2025)
Preview abstract
While large language models (LLMs) have shown promise in diagnostic dialogue, their capabilities for effective management reasoning - including disease progression, therapeutic response, and safe medication prescription - remain under-explored. We advance the previously demonstrated diagnostic capabilities of the Articulate Medical Intelligence Explorer (AMIE) through a new LLM-based agentic system optimised for clinical management and dialogue, incorporating reasoning over the evolution of disease and multiple patient visit encounters, response to therapy, and professional competence in medication prescription. To ground its reasoning in authoritative clinical knowledge, AMIE leverages Gemini's long-context capabilities, combining in-context retrieval with structured reasoning to align its output with relevant and up-to-date clinical practice guidelines and drug formularies. In a randomized, blinded virtual Objective Structured Clinical Examination (OSCE) study, AMIE was compared to 21 primary care physicians (PCPs) across 100 multi-visit case scenarios designed to reflect UK NICE Guidance and BMJ Best Practice guidelines. AMIE was non-inferior to PCPs in management reasoning as assessed by specialist physicians and scored better in both preciseness of treatments and investigations, and in its alignment with and grounding of management plans in clinical guidelines. To benchmark medication reasoning, we developed RxQA, a multiple-choice question benchmark derived from two national drug formularies (US, UK) and validated by board-certified pharmacists. While AMIE and PCPs both benefited from the ability to access external drug information, AMIE outperformed PCPs on higher difficulty questions. While further research would be needed before real-world translation, AMIE's strong performance across evaluations marks a significant step towards conversational AI as a tool in disease management.
View details
A Call to Action: Advancing the Conversation Around Neurodivergent Education-Employment Transitions
Dannie Lynn Fountain
Vicki Baker
Kevin Danley
Closing the Gap (2025)
Preview abstract
Neurodiversity is still largely stigmatized and excluded from DEIB frameworks and related organizational initiatives, despite the increased recognition regarding the benefits of neuroinclusion within the education and corporate spheres. We seek to address this knowledge-to-practice gap through the creation of the Neurodiversity Engagement Framework. By highlighting supports needed for neurodivergent individuals, and those that support them, the framework helps neurodivergent individuals navigate within and across higher education and industry contexts. Informed by an interdisciplinary review of literature from higher education, industry, and corporate leadership contexts, the Neurodiversity Engagement Framework brings to light prevailing challenges within practices and policies, serving as a guide for the creation of a more supportive foundation for neurodiverse individuals to thrive. In this manuscript, readers are encouraged to consider the myriad of impacts that neurodiversity has on higher education and industry experiences and the ways that organizations can be more proactive in their support of this growing population. To conclude, we offer a roadmap for future research and practice to further elucidate ways academic and corporation leaders and policymakers can effectively support neurodivergent individuals.
View details
Scalable Private Partition Selection via Adaptive Weighting
Justin Y. Chen
Forty-second International Conference on Machine Learning (2025)
Preview abstract
In the differentially private partition selection problem (a.k.a. private set union, private key discovery), users hold subsets of items from an unbounded universe. The goal is to output as many items as possible from the union of the users' sets while maintaining user-level differential privacy. Solutions to this problem are a core building block for many privacy-preserving ML applications including vocabulary extraction in a private corpus, computing statistics over categorical data and learning embeddings over user-provided items.
We propose an algorithm for this problem, MaxAdaptiveDegree(MAD), which adaptively reroutes weight from items with weight far above the threshold needed for privacy to items with smaller weight, thereby increasing the probability that less frequent items are output. Our algorithm can be efficiently implemented in massively parallel computation systems allowing scalability to very large datasets. We prove that our algorithm stochastically dominates the standard parallel algorithm for this problem. We also develop a two-round version of our algorithm, MAD2R, where results of the computation in the first round are used to bias the weighting in the second round to maximize the number of items output. In experiments, our algorithms provide the best results across the board among parallel algorithms and scale to datasets with hundreds of billions of items, up to three orders of magnitude larger than those analyzed by prior sequential algorithms.
View details
Toward Sensor-In-the-Loop LLM Agent: Benchmarks and Implications
Zhiwei Ren
Junbo Li
Minjia Zhang
Di Wang
Longfei Shangguan
SenSys 2025 - The 23rd ACM Conference on Embedded Networked Sensor Systems (2025)
Preview abstract
This paper advocates for sensor-informed personal agents that can take advantage of sensor hints on wearables to enhance the personal agent's response. We demonstrate that such a sensor-in-the-loop design paradigm can be easily integrated into existing LLM agents by building a prototype named WellMax based on existing well-developed techniques such as structured prompt tuning and few-shot prompting. The head-to-head comparison with a non-sensor-informed agent across five use scenarios demonstrates that this sensor-in-the-loop design can effectively improve users' needs and their overall experience. The deep-dive into agents' replies and participants' feedback further reveals that sensor-in-the-loop agents not only provide more contextually relevant responses but also exhibit a greater understanding of user priorities and situational nuances. We further conduct two case studies to examine the potential pitfalls and distill key insights from this sensor-in-the-loop agent. We believe this work sets the stage for more intelligent, empathetic, and effective interactions in future AI-driven personal assistants.
View details
Preview abstract
We consider the Coalition Structure Learning (CSL) problem in multi-agent systems, motivated by the existence of coalitions in many real-world systems, e.g., trading platforms and auction systems. In this problem, there is a hidden coalition structure within a set of $n$ agents, which affects the behavior of the agents in games. Our goal is to actively design a sequence of games
for the agents to play, such that observations in these games can be used to learn the hidden coalition structure. In particular, we consider the setting where in each round, we design and present a game together with a strategy profile to the agents, and receive a multiple-bit observation -- for each agent, we observe whether or not they would like to deviate from the specified strategy in this given game. Our contributions are three-fold: First, we show that we can learn the coalition structure in $O(\log n)$ rounds if we are allowed to choose any normal-form game in each round, matching the information-theoretical lower bound, and the result can be extended to congestion games. Second, in a more restricted setting where we can only choose a graphical game with degree limit $d$, we develop an algorithm to learn the coalition structure in $O(n/d+\log d)$ rounds. Third, when we can only learn the coalition structure through running second-price auctions with personalized reserve prices, we show that the coalition structure can be learned in $O(c\log n)$ rounds, where $c$ is the size of the largest coalition.
View details
Preview abstract
Browser fingerprinting is an online tracking technique that is being increasingly adopted for profiling and ad targeting purposes. While prior work has analyzed the prevalence and impact of browser fingerprinting on the Web, they have traditionally relied on large-scale automated crawls. Naturally, these cannot replicate real-human interactions, e.g., solve CAPTCHAs, evade bot detectors, or operate behind login pages and paywalls. This prompts the question as to whether or not the fingerprinting ecosystem is appreciably different in real-world browsing sessions. In this paper, we begin to address this question by designing and conducting a user study aimed at collecting actual telemetry data from real browsing sessions of 30 users.
We find that almost half of the fingerprinting websites identified from real user browsing sessions are missed by equivalent automated crawls. This is mainly due to the inability of automated crawls to identify and visit authentication pages, being blocked by bot detectors, and/or failing to perform user interactions that specifically trigger browser fingerprinting scripts. We also find new fingerprinting vectors that are consistently present in fingerprinting scripts captured by real user browsing sessions yet missing from automated crawls. Finally, we assess the feasibility of collecting fingerprinting training data in a privacy-preserving way. We conclude that private models built on real user browsing sessions can detect browser fingerprinting more effectively than models trained on automated crawls alone, while simultaneously providing strong privacy guarantees to users.
View details
Mind the GAP: Geometry Aware Passthrough Mitigates Cybersickness
Trishia Chemaly
Mohit Goyal
Sakar Khattar
Bjorn Vlaskamp
Aveek Purohit
Konstantine Tsotsos
2025
Preview abstract
Virtual Reality headsets isolate users from the real-world by restricting their perception to the virtual-world. Video See-Through (VST) headsets address this by utilizing world-facing cameras to create Augmented Reality experiences. However, directly displaying camera feeds can cause visual discomfort and cybersickness due to the inaccurate perception of scale and exaggerated motion parallax. This paper presents initial findings on the potential of geometry aware passthrough systems to mitigate cybersickness through enhanced depth perception. We introduce a promising protocol for quantitatively measuring cybersickness experienced by users in VST headsets. Using this protocol, we conduct a user study to compare direct passthrough and geometry aware passthrough systems. To the best of our knowledge, our study is the first one to reveal reduced nausea, disorientation, and total scores of cybersickness with geometry aware passthrough. It also uncovers several potential avenues to further mitigate visually-induced discomfort.
View details
Mufu: Multilingual Fused Learning for Low- Resource Translation with LLM
Zheng Lim
Honglin Yu
Trevor Cohn
International Conference on Learning Representations (ICLR) 2025
Preview abstract
Multilingual large language models (LLMs) are great translators, but this is largely limited to high-resource languages. For many LLMs, translating in and out of low-resource languages remains a challenging task. To maximize data efficiency in this low-resource setting, we introduce Mufu, which includes a selection of automatically generated multilingual candidates and an instruction to correct inaccurate translations in the prompt. Mufu prompts turn a translation task into a postediting one, and seek to harness the LLM's reasoning capability with auxiliary translation candidates, from which the model is required to assess the input quality, align the semantics cross-lingually, copy from relevant inputs and override instances that are incorrect. Our experiments on En-XX translations over the Flores-200 dataset show LLMs finetuned against Mufu-style prompts are robust to poor quality auxiliary translation candidates, achieving performance superior to NLLB 1.3B distilled model in 64% of low- and very-low-resource language pairs. We then distill these models to reduce inference cost, while maintaining on average 3.1 chrF improvement over finetune-only baseline in low-resource translations.
View details
A Scalable Framework for Evaluating Health Language Models
Neil Mallinar
Tony Faranesh
Brent Winslow
Nova Hammerquist
Ben Graef
Cathy Speed
Mark Malhotra
Shwetak Patel
Xavi Prieto
Daniel McDuff
Ahmed Metwally
(2025)
Preview abstract
Large language models (LLMs) have emerged as powerful tools for analyzing complex datasets. Recent studies demonstrate their potential to generate useful, personalized responses when provided with patient-specific health information that encompasses lifestyle, biomarkers, and context. As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety. Current evaluation practices for open-ended text responses heavily rely on human experts. This approach introduces human factors and is often cost-prohibitive, labor-intensive, and hinders scalability, especially in complex domains like healthcare where response assessment necessitates domain expertise and considers multifaceted patient data. In this work, we introduce Adaptive Precise Boolean rubrics: an evaluation framework that streamlines human and automated evaluation of open-ended questions by identifying gaps in model responses using a minimal set of targeted rubrics questions. Our approach is based on recent work in more general evaluation settings that contrasts a smaller set of complex evaluation targets with a larger set of more precise, granular targets answerable with simple boolean responses. We validate this approach in metabolic health, a domain encompassing diabetes, cardiovascular disease, and obesity. Our results demonstrate that Adaptive Precise Boolean rubrics yield higher inter-rater agreement among expert and non-expert human evaluators, and in automated assessments, compared to traditional Likert scales, while requiring approximately half the evaluation time of Likert-based methods. This enhanced efficiency, particularly in automated evaluation and non-expert contributions, paves the way for more extensive and cost-effective evaluation of LLMs in health.
View details
Preview abstract
Measuring software development can help drive impactful change. However, it’s a complex task, and getting started can be daunting as it involves understanding what you should measure, and determining what you can measure. This article provides a guide to selecting a framework that aligns with organizational measurement strategy.
View details
From Few to Many: Self-Improving Many-Shot Reasoners Through Iterative Optimization and Generation
Han Zhou
Hootan Nakhost
Ke Jiang
International Conference on Learning Representations (ICLR) (2025)
Preview abstract
Recent advances in long-context large language models (LLMs) have led to the emerging paradigm of many-shot in-context learning (ICL), where it is observed that scaling many more demonstrating examples beyond the conventional few-shot setup in the context can lead to performance benefits. However, despite its promise, it is unclear what aspects dominate the benefits and whether simply scaling to more examples is the most effective way of improving many-shot ICL. In this work, we first provide an analysis of the factors driving many-shot ICL, and we find that 1) many-shot performance can still be attributed to often a few disproportionately influential examples and 2) identifying such influential examples ("optimize") and using them as demonstrations to regenerate new examples ("generate") can lead to further improvements. Inspired by the findings, we propose BRIDGE, an algorithm that alternates between the optimize step with Bayesian optimization to discover the influential sets of examples and the generate step to reuse this set to expand the reasoning paths of the examples back to the many-shot regime automatically. On Gemini, Claude, and Mistral LLMs of different sizes, we show that BRIDGE to significant improvements across a diverse set of tasks, including symbolic reasoning, numerical reasoning, and code generation.
View details
Silent Data Corruption by 10× Test Escapes Threatens Reliable Computing
Rama Govindaraju
Eric Liu
Subhasish Mitra
Mike Fuller
IEEE (2025) (to appear)
Preview abstract
Summary:
Silent Data Corruption by 10x Test Escapes Threatens Reliable Computing" highlights a critical issue: manufacturing defects, dubbed "test escapes," are evading current testing methods at an alarming rate, ten times higher than industry targets. These defects lead to Silent Data Corruption (SDC), where applications produce incorrect outputs without error indications, costing companies significantly in debugging, data recovery, and service disruptions. The paper proposes a three-pronged approach: quick diagnosis of defective chips directly from system-level behaviors, in-field detection using advanced testing and error detection techniques like CASP, and new, rigorous test experiments to validate these solutions and improve manufacturing testing practices.
View details
GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Version Incompatibilities
Diganta Misra
Nizar Islah
Brice Rauby
Zihan Wang
Justine Gehring
Antonio Orvieto
Muawiz Chaudhary
Eilif Muller
Irina Rish
Samira Ebrahimi Kahou
Massimo Caccia
2025
Preview abstract
The rapid evolution of software libraries poses a considerable hurdle for code generation, necessitating continuous adaptation to frequent version updates while preserving backward compatibility. While existing code evolution benchmarks provide valuable insights, they typically lack execution-based evaluation for generating code compliant with specific library versions. To address this, we introduce GitChameleon 2.0, a novel, meticulously curated dataset comprising 328 Python code completion problems, each conditioned on specific library versions and accompanied by executable unit tests. GitChameleon 2.0 rigorously evaluates the capacity of contemporary large language models (LLMs), LLM-powered agents, code assistants, and RAG systems to perform version-conditioned code generation that demonstrates functional accuracy through execution. Our extensive evaluations indicate that state-of-the-art systems encounter significant challenges with this task; enterprise models achieving baseline success rates in the 48-51% range, underscoring the intricacy of the problem. By offering an execution-based benchmark emphasizing the dynamic nature of code libraries, GitChameleon 2.0 enables a clearer understanding of this challenge and helps guide the development of more adaptable and dependable AI code generation methods.
View details