Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 10331 publications
    Scaling Laws for Downstream Task Performance in Machine Translation
    Natalia Ponomareva
    Hussein Hazimeh
    Sanmi Koyejo
    International Conference on Learning Representations (ICLR) (2025) (to appear)
    Preview abstract Scaling laws provide important insights that can guide the design of large language models (LLMs). Existing work has primarily focused on studying scaling laws for pretraining (upstream) loss. However, in transfer learning settings, in which LLMs are pretrained on an unsupervised dataset and then finetuned on a downstream task, we often also care about the downstream performance. In this work, we study the scaling behavior in a transfer learning setting, where LLMs are finetuned for machine translation tasks. Specifically, we investigate how the choice of the \emph{pretraining} data and its size affect downstream performance (translation quality) as judged by: downstream cross-entropy and translation quality metrics such as BLEU and COMET scores. Our experiments indicate that the size of the finetuning dataset and the distribution alignment between the pretraining and downstream data significantly influence the scaling behavior. With sufficient alignment, both downstream cross-entropy and translation quality scores improve monotonically with more pretraining data. In such cases, we show that it is possible to predict the downstream translation quality metrics with good accuracy using a log-law. However, there are cases where moderate misalignment causes the downstream translation scores to fluctuate or get worse with more pretraining, whereas downstream cross-entropy monotonically improves. By analyzing these, we provide new practical insights for choosing appropriate pretraining data. View details
    Oculomics: Current Concepts and Evidence
    Zhuoting Zhu
    Yueye Wang
    Ziyi Qi
    Wenyi Hu
    Xiayin Zhang
    Siegfried Wagner
    Yujie Wang
    An Ran Ran
    Joshua Ong
    Ethan Waisberg
    Mouayad Masalkhi
    Alex Suh
    Yih Chung Tham
    Carol Y. Cheung
    Xiaohong Yang
    Honghua Yu
    Zongyuan Ge
    Wei Wang
    Bin Sheng
    Andrew G. Lee
    Alastair Denniston
    Peter van Wijngaarden
    Pearse Keane
    Ching-Yu Cheng
    Mingguang He
    Tien Yin Wong
    Progress in Retinal and Eye Research (2025)
    Preview abstract The eye provides novel insights into general health, as well as pathogenesis and development of systemic diseases. In the past decade, growing evidence has demonstrated that the eye's structure and function mirror multiple systemic health conditions, especially in cardiovascular diseases, neurodegenerative disorders, and kidney impairments. This has given rise to the field of oculomics- the application of ophthalmic biomarkers to understand mechanisms, detect and predict disease. The development of this field has been accelerated by three major advances: 1) the availability and widespread clinical adoption of high-resolution and non-invasive ophthalmic imaging (“hardware”); 2) the availability of large studies to interrogate associations (“big data”); 3) the development of novel analytical methods, including artificial intelligence (AI) (“software”). Oculomics offers an opportunity to enhance our understanding of the interplay between the eye and the body, while supporting development of innovative diagnostic, prognostic, and therapeutic tools. These advances have been further accelerated by developments in AI, coupled with large-scale linkage datasets linking ocular imaging data with systemic health data. Oculomics also enables the detection, screening, diagnosis, and monitoring of many systemic health conditions. Furthermore, oculomics with AI allows prediction of the risk of systemic diseases, enabling risk stratification, opening up new avenues for prevention or individualized risk prediction and prevention, facilitating personalized medicine. In this review, we summarise current concepts and evidence in the field of oculomics, highlighting the progress that has been made, remaining challenges, and the opportunities for future research. View details
    Triply efficient shadow tomography
    Robbie King
    David Gosset
    PRX Quantum, 6 (2025), pp. 010336
    Preview abstract Given copies of a quantum state $\rho$, a shadow tomography protocol aims to learn all expectation values from a fixed set of observables, to within a given precision $\epsilon$. We say that a shadow tomography protocol is \textit{triply efficient} if it is sample- and time-efficient, and only employs measurements that entangle a constant number of copies of $\rho$ at a time. The classical shadows protocol based on random single-copy measurements is triply efficient for the set of local Pauli observables. This and other protocols based on random single-copy Clifford measurements can be understood as arising from fractional colorings of a graph $G$ that encodes the commutation structure of the set of observables. Here we describe a framework for two-copy shadow tomography that uses an initial round of Bell measurements to reduce to a fractional coloring problem in an induced subgraph of $G$ with bounded clique number. This coloring problem can be addressed using techniques from graph theory known as \textit{chi-boundedness}. Using this framework we give the first triply efficient shadow tomography scheme for the set of local fermionic observables, which arise in a broad class of interacting fermionic systems in physics and chemistry. We also give a triply efficient scheme for the set of all $n$-qubit Pauli observables. Our protocols for these tasks use two-copy measurements, which is necessary: sample-efficient schemes are provably impossible using only single-copy measurements. Finally, we give a shadow tomography protocol that compresses an $n$-qubit quantum state into a $\poly(n)$-sized classical representation, from which one can extract the expected value of any of the $4^n$ Pauli observables in $\poly(n)$ time, up to a small constant error. View details
    Sufficient Context: A New Lens on Retrieval Augmented Generation Systems
    Hailey Joren
    Jianyi Zhang
    Chun-Sung Ferng
    Ankur Taly
    International Conference on Learning Representations (ICLR) (2025)
    Preview abstract Augmenting LLMs with context leads to improved performance across many applications. Despite much research on Retrieval Augmented Generation (RAG) systems, an open question is whether errors arise because LLMs fail to utilize the context from retrieval or the context itself is insufficient to answer the query. To shed light on this, we develop a new notion of sufficient context, along with a method to classify instances that have enough information to answer the query. We then use sufficient context to analyze several models and datasets. By stratifying errors based on context sufficiency, we find that larger models with higher baseline performance (Gemini 1.5 Pro, GPT 4o, Claude 3.5) excel at answering queries when the context is sufficient, but often output incorrect answers instead of abstaining when the context is not. On the other hand, smaller models with lower baseline performance (Llama 3.1, Mistral 3, Gemma 2) hallucinate or abstain often, even with sufficient context. We further categorize cases when the context is useful, and improves accuracy, even though it does not fully answer the query and the model errs without the context. Building on our findings, we explore ways to reduce hallucinations in RAG systems, including a new selective generation method that leverages sufficient context information for guided abstention. Our method improves the fraction of correct answers among times where the model responds by 2--10% for Gemini, GPT, and Gemma. View details
    PreFix: Optimizing the Performance of Heap-Intensive Applications
    Chaitanya Mamatha Ananda
    Rajiv Gupta
    Han Shen
    CGO 2025: International Symposium on Code Generation and Optimization, Las Vegas, NV, USA (to appear)
    Preview abstract Analyses of heap-intensive applications show that a small fraction of heap objects account for the majority of heap accesses and data cache misses. Prior works like HDS and HALO have shown that allocating hot objects in separate memory regions can improve spatial locality leading to better application performance. However, these techniques are constrained in two primary ways, limiting their gains. First, these techniques have Imperfect Separation, polluting the hot memory region with several cold objects. Second, reordering of objects across allocations is not possible as the original object allocation order is preserved. This paper presents a novel technique that achieves near perfect separation of hot objects via a new context mechanism that efficiently identifies hot objects with high precision. This technique, named PreFix, is based upon Preallocating memory for a Fixed small number of hot objects. The program, guided by profiles, is instrumented to compute context information derived from dynamic object identifiers, that precisely identifies hot object allocations that are then placed at predetermined locations in the preallocated memory. The preallocated memory region for hot objects provides the flexibility to reorder objects across allocations and allows colocation of objects that are part of a hot data stream (HDS), improving spatial locality. The runtime overhead of identifying hot objects is not significant as this optimization is only focused on a small number of static hot allocation sites and dynamic hot objects. While there is an increase in the program’s memory foot-print, it is manageable and can be controlled by limiting the size of the preallocated memory. In addition, PreFix incorporates an object recycling optimization that reuses the same preallocated space to store different objects whose lifetimes are not expected to overlap. Our experiments with 13 heap-intensive applications yields reductions in execution times ranging from 2.77% to 74%. On average PreFix reduces execution time by 21.7% compared to 7.3% by HDS and 14% by HALO. This is due to PreFix’s precision in hot object identification, hot object colocation, and low runtime overhead. View details
    Preview abstract Cloud application development faces the inherent challenge of balancing rapid innovation with high availability. This blog post details how Google Workspace's Site Reliability Engineering team addresses this conflict by implementing vertical partitioning of serving stacks. By isolating application servers and storage into distinct partitions, the "blast radius" of code changes and updates is significantly reduced, minimizing the risk of global outages. This approach, which complements canary deployments, enhances service availability, provides flexibility for experimentation, and facilitates data localization. While challenges such as data model complexities and inter-service partition misalignment exist, the benefits of improved reliability and controlled deployments make partitioning a crucial strategy for maintaining robust cloud applications View details
    Preview abstract While large language models (LLMs) have shown promise in diagnostic dialogue, their capabilities for effective management reasoning - including disease progression, therapeutic response, and safe medication prescription - remain under-explored. We advance the previously demonstrated diagnostic capabilities of the Articulate Medical Intelligence Explorer (AMIE) through a new LLM-based agentic system optimised for clinical management and dialogue, incorporating reasoning over the evolution of disease and multiple patient visit encounters, response to therapy, and professional competence in medication prescription. To ground its reasoning in authoritative clinical knowledge, AMIE leverages Gemini's long-context capabilities, combining in-context retrieval with structured reasoning to align its output with relevant and up-to-date clinical practice guidelines and drug formularies. In a randomized, blinded virtual Objective Structured Clinical Examination (OSCE) study, AMIE was compared to 21 primary care physicians (PCPs) across 100 multi-visit case scenarios designed to reflect UK NICE Guidance and BMJ Best Practice guidelines. AMIE was non-inferior to PCPs in management reasoning as assessed by specialist physicians and scored better in both preciseness of treatments and investigations, and in its alignment with and grounding of management plans in clinical guidelines. To benchmark medication reasoning, we developed RxQA, a multiple-choice question benchmark derived from two national drug formularies (US, UK) and validated by board-certified pharmacists. While AMIE and PCPs both benefited from the ability to access external drug information, AMIE outperformed PCPs on higher difficulty questions. While further research would be needed before real-world translation, AMIE's strong performance across evaluations marks a significant step towards conversational AI as a tool in disease management. View details
    Shadow Hamiltonian Simulation
    Rolando Somma
    Robbie King
    Tom O'Brien
    Nature Communications, 16 (2025), pp. 2690
    Preview abstract Simulating quantum dynamics is one of the most important applications of quantum computers. Traditional approaches for quantum simulation involve preparing the full evolved state of the system and then measuring some physical quantity. Here, we present a different and novel approach to quantum simulation that uses a compressed quantum state that we call the "shadow state". The amplitudes of this shadow state are proportional to the time-dependent expectations of a specific set of operators of interest, and it evolves according to its own Schrödinger equation. This evolution can be simulated on a quantum computer efficiently under broad conditions. Applications of this approach to quantum simulation problems include simulating the dynamics of exponentially large systems of free fermions or free bosons, the latter example recovering a recent algorithm for simulating exponentially many classical harmonic oscillators. These simulations are hard for classical methods and also for traditional quantum approaches, as preparing the full states would require exponential resources. Shadow Hamiltonian simulation can also be extended to simulate expectations of more complex operators such as two-time correlators or Green's functions, and to study the evolution of operators themselves in the Heisenberg picture. View details
    Capturing Real-World Habitual Sleep Patterns with a Novel User-centric Algorithm to Pre-Process Fitbit Data in the All of Us Research Program: Retrospective observational longitudinal study
    Hiral Master
    Jeffrey Annis
    Karla Gleichauf
    Lide Han
    Peyton Coleman
    Kelsie Full
    Neil Zheng
    Doug Ruderfer
    Logan Schneider
    Evan Brittain
    Journal of Medical Internet Research (2025) (to appear)
    Preview abstract Background: Commercial wearables like Fitbits quantify sleep metrics using fixed calendar times as the default measurement periods, which may not adequately account for individual variations in sleep patterns. To address this, experts in sleep medicine and wearables developed a user-centric algorithm that more accurately reflects actual sleep behaviors, aiming to improve wearable-derived sleep metrics. Objective: The study aimed to describe the development of the new (user-centric) algorithm, and how it compares with the default (calendar-relative), and offers best practices for analyzing All of Us Fitbit sleep data on a cloud platform. Methods: The default and new algorithms was implemented to pre-process and then compute sleep metrics related to schedule, duration, and disturbances using high-resolution Fitbit sleep data from 8,563 participants (median age 58.1 years, 72% female) in the All of Us Research Program (v7 Controlled Tier). Variation in typical sleep patterns was computed by taking the differences in the mean number of primary sleep logs classified by each algorithm. Linear mixed-effects models were used to compare differences in sleep metrics across quartiles of variation in typical sleep patterns. Results: Out of 8,452,630 total sleep logs over a median of 4.2 years of Fitbit monitoring, 401,777 (5%) non-primary sleep logs identified by default algorithm were reclassified to primary sleep by the user-centric algorithm. Variation in typical sleep patterns ranged from -0.08 to 1. Among participants with the most variation in typical sleep patterns, the new algorithm identified more total sleep time (by 17.6 minutes; P<0.001), more wake after sleep onset (by 13.9 minutes; P<0.001), and lower sleep efficiency (by 2.0%; P<0.001), on average. There were only modest differences in sleep stage metrics between the two algorithms. Conclusions: The user-centric algorithm captures the natural variability in sleep schedules, offering an alternative way to pre-process and evaluate sleep metrics related to schedule, duration, and disturbances. R package is publicly available to facilitate the implementation of this algorithm for clinical and translational use. View details
    Preview abstract Measuring productivity is equivalent to building a model. All models are wrong, but some are useful. Productivity models are often “worryingly selective” (wrong because of omissions). Worrying selectivity can be combated by taking a holistic approach that includes multiple measurements of multiple outcomes. Productivity models should include multiple outcomes, metrics, and methods. View details
    Preview abstract Despite the surge in popularity of virtual reality (VR), mobile phones remain the primary medium for accessing digital content, offering both privacy and portability. This short paper presents Beyond the Phone, a novel framework that enhances mobile phones in VR with context-aware controls and spatial augmentation. We first establish a comprehensive design space through brainstorming and iterative discussions with VR experts. We then develop a proof-of-concept system that analyzes UI layouts to offer context-aware controls and spatial augmentation, targeting six key application areas within our design space. Finally, we demonstrate that our system can effectively adapt to a broad spectrum of applications at runtime, and discuss future directions with reviews with seven experts. View details
    Zero-Shot Image Moderation in Google Ads with LLM-Assisted Textual Descriptions and Cross-modal Co-embeddings
    Jimin Li
    Eric Xiao
    Katie Warren
    Enming Luo
    Krishna Viswanathan
    Ariel Fuxman
    Bill Li
    Yintao Liu
    (2025)
    Preview abstract We present a scalable and agile approach for ads image content moderation at Google, addressing the challenges of moderating massive volumes of ads with diverse content and evolving policies. The proposed method utilizes human-curated textual descriptions and cross-modal text-image co-embeddings to enable zero-shot classification of policy violating ads images, bypassing the need for extensive supervised training data and human labeling. By leveraging large language models (LLMs) and user expertise, the system generates and refines a comprehensive set of textual descriptions representing policy guidelines. During inference, co-embedding similarity between incoming images and the textual descriptions serves as a reliable signal for policy violation detection, enabling efficient and adaptable ads content moderation. Evaluation results demonstrate the efficacy of this framework in significantly boosting the detection of policy violating content. View details
    Validation of a Deep Learning Model for Diabetic Retinopathy on Patients with Young-Onset Diabetes
    Tony Tan-Torres
    Pradeep Praveen
    Divleen Jeji
    Arthur Brant
    Xiang Yin
    Lu Yang
    Tayyeba Ali
    Ilana Traynis
    Dushyantsinh Jadeja
    Rajroshan Sawhney
    Sunny Virmani
    Pradeep Venkatesh
    Nikhil Tandon
    Ophthalmology and Therapy (2025)
    Preview abstract Introduction While many deep learning systems (DLSs) for diabetic retinopathy (DR) have been developed and validated on cohorts with an average age of 50s or older, fewer studies have examined younger individuals. This study aimed to understand DLS performance for younger individuals, who tend to display anatomic differences, such as prominent retinal sheen. This sheen can be mistaken for exudates or cotton wool spots, and potentially confound DLSs. Methods This was a prospective cross-sectional cohort study in a “Diabetes of young” clinic in India, enrolling 321 individuals between ages 18 and 45 (98.8% with type 1 diabetes). Participants had fundus photographs taken and the photos were adjudicated by experienced graders to obtain reference DR grades. We defined a younger cohort (age 18–25) and an older cohort (age 26–45) and examined differences in DLS performance between the two cohorts. The main outcome measures were sensitivity and specificity for DR. Results Eye-level sensitivity for moderate-or-worse DR was 97.6% [95% confidence interval (CI) 91.2, 98.2] for the younger cohort and 94.0% [88.8, 98.1] for the older cohort (p = 0.418 for difference). The specificity for moderate-or-worse DR significantly differed between the younger and older cohorts, 97.9% [95.9, 99.3] and 92.1% [87.6, 96.0], respectively (p = 0.008). Similar trends were observed for diabetic macular edema (DME); sensitivity was 79.0% [57.9, 93.6] for the younger cohort and 77.5% [60.8, 90.6] for the older cohort (p = 0.893), whereas specificity was 97.0% [94.5, 99.0] and 92.0% [88.2, 95.5] (p = 0.018). Retinal sheen presence (94% of images) was associated with DME presence (p < 0.0001). Image review suggested that sheen presence confounded reference DME status, increasing noise in the labels and depressing measured sensitivity. The gradability rate for both DR and DME was near-perfect (99% for both). Conclusion DLS-based DR screening performed well in younger individuals aged 18–25, with comparable sensitivity and higher specificity compared to individuals aged 26–45. Sheen presence in this cohort made identification of DME difficult for graders and depressed measured DLS sensitivity; additional studies incorporating optical coherence tomography may improve accuracy of measuring DLS DME sensitivity. View details
    Heterogenous graph neural networks for species distribution modeling
    Christine Kaeser-Chen
    Keith Anderson
    Michelangelo Conserva
    Elise Kleeman
    Maxim Neumann
    Matt Overlan
    Millie Chapman
    Drew Purves
    arxiv (2025)
    Preview abstract Species distribution models (SDMs) are necessary for measuring and predicting occurrences and habitat suitability of species and their relationship with environmental factors. We introduce a novel presence-only SDM with graph neural networks (GNN). In our model, species and locations are treated as two distinct node sets, and the learning task is predicting detection records as the edges that connect locations to species. Using GNN for SDM allows us to model fine-grained interactions between species and the environment. We evaluate the potential of this methodology on the six-region dataset compiled by National Center for Ecological Analysis and Synthesis (NCEAS) for benchmarking SDMs. For each of the regions, the heterogeneous GNN model is comparable to or outperforms previously-benchmarked single-species SDMs as well as a feed-forward neural network baseline model. View details
    Software development is a team sport
    Jie Chen
    Alison Chang
    Rayven Plaza
    Marie Huber
    Claire Taylor
    IEEE Software (2025)
    Preview abstract In this article, we describe our human-centered research focused on understanding the role of collaboration and teamwork in productive software development. We describe creation of a logs-based metric to identify collaboration through observable events and a survey-based multi-item scale to assess team functioning. View details