Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 10132 publications
    Preview abstract Personalized recommendation requires understanding both the candidate items and user preferences. Traditional collaborative filtering approaches rely on embedding users and items in the same representation space while recent efforts formulate the problem into sequential user activity modeling and future activity prediction tasks. Some of the most recent efforts leverage autoregressive large language models to directly generate the recommendation. This work proposes CALRec, a sequential recommendation framework aligning the generative task based on PaLM-2 LLM with contrastive learning tasks for user/item understanding. To leverage the strong generalization capabilities of the state-of-the-art pretrained LLMs, our input consists of pure texts following differentiable text templates for user inputs and item inputs. We propose novel ways of combining generative loss and contrastive losses in multi-category joint continuous pretraining, followed by domain-specific finetuning. During training, the LLM backbone trains in a two-tower fashion to comprehend users’ consecutive behaviors and descriptions of individual items. Our model outperforms many state-of-the-art baselines significantly especially in ranking tasks. Our systematic ablation study reveals that (i) multi-category pretraining and domain-adaptation finetuning are both important and deliver better performance when combined, and (ii) contrastive alignment further improves the quality among many categories of the Amazon review dataset. View details
    Optimizing quantum gates towards the scale of logical qubits
    Alexandre Bourassa
    Andrew Dunsworth
    Will Livingston
    Vlad Sivak
    Trond Andersen
    Yaxing Zhang
    Desmond Chik
    Jimmy Chen
    Charles Neill
    Alejo Grajales Dau
    Anthony Megrant
    Alexander Korotkov
    Vadim Smelyanskiy
    Yu Chen
    Nature Communications, 15 (2024), pp. 2442
    Preview abstract A foundational assumption of quantum error correction theory is that quantum gates can be scaled to large processors without exceeding the error-threshold for fault tolerance. Two major challenges that could become fundamental roadblocks are manufacturing high-performance quantum hardware and engineering a control system that can reach its performance limits. The control challenge of scaling quantum gates from small to large processors without degrading performance often maps to non-convex, high-constraint, and time-dynamic control optimization over an exponentially expanding configuration space. Here we report on a control optimization strategy that can scalably overcome the complexity of such problems. We demonstrate it by choreographing the frequency trajectories of 68 frequency-tunable superconducting qubits to execute single- and two-qubit gates while mitigating computational errors. When combined with a comprehensive model of physical errors across our processor, the strategy suppresses physical error rates by ~3.7× compared with the case of no optimization. Furthermore, it is projected to achieve a similar performance advantage on a distance-23 surface code logical qubit with 1057 physical qubits. Our control optimization strategy solves a generic scaling challenge in a way that can be adapted to a variety of quantum operations, algorithms, and computing architectures. View details
    Preview abstract As AI systems quickly improve in both breadth and depth of performance, they lend themselves to creating increasingly powerful and realistic agents, including the possibility of agents modeled on specific people. We anticipate that within our lifetimes it may become common practice for people to create a custom AI agent to interact with loved ones and/or the broader world after death. We call these generative ghosts, since such agents will be capable of generating novel content rather than merely parroting content produced by their creator while living. In this paper, we first discuss the design space of potential implementations of generative ghosts. We then discuss the practical and ethical implications of generative ghosts, including potential positive and negative impacts on individuals and society. Based on these considerations, we lay out a research agenda for the AI and HCI research communities to empower people to create and interact with AI afterlives in a safe and beneficial manner. View details
    Preview abstract Learning from Label Proportions (LLP) is a learning problem where only aggregate level labels are available for groups of instances, called bags, during training, and the aim is to get the best performance at the instance-level on the test data. This setting arises in domains like advertising and medicine due to privacy considerations. We propose a novel algorithmic framework for this problem that iteratively performs two main steps. For the first step (Pseudo Labeling) in every iteration, we define a Gibbs distribution over binary instance labels that incorporates a) covariate information through the constraint that instances with similar covariates should have similar labels and b) the bag level aggregated label. We then use Belief Propagation (BP) to marginalize the Gibbs distribution to obtain pseudo labels. In the second step (Embedding Refinement), we use the pseudo labels to provide supervision for a learner that yields a better embedding. Further, we iterate on the two steps again by using the second step's embeddings as new covariates for the next iteration. In the final iteration, a classifier is trained using the pseudo labels. Our algorithm displays strong gains against several SOTA baselines for the LLP Binary Classification problem on various dataset types - Small Tabular, Large Tabular and Images. We achieve these improvements with minimal computational overhead above standard supervised learning due to Belief Propagation, for large bag sizes, even for a million samples. View details
    Specifying BGP using TLA+
    Aman Shaikh
    (2024)
    Preview abstract This presentation is about the TLA+ specification we have written for BGP, the routing protocol underpinning the Internet. The specification also serves as a crucial first-step towards the use of TLA+ for verification of network designs. View details
    Preview abstract In the present computerized period, information driven navigation is essential for the progress of cooperative work areas. This paper gives an extensive examination of how information designing, distributed storage, and business insight synergistically engage groups. We look at the basic standards of information designing, zeroing in on the plan, development, and the management of adaptable information pipelines. The job of distributed storage is investigated, featuring its ability to give adaptable, secure, and open information arrangements. Besides, we dive into business knowledge instruments and their capacity to change crude information into significant experiences. Through contextual analyses and exact information, we delineate the groundbreaking effect of these advances in group efficiency, coordinated effort, and dynamic cycles. This examination highlights the significance of incorporating hearty information designing works on, utilizing distributed storage arrangements, and utilizing complex business knowledge apparatuses to establish information engaged cooperative conditions. View details
    API Governance at Scale
    Mak Ahmad
    JJ Geewax
    David R Karger
    Kwan-Liu Ma
    ICSE 2024 Software Engineering in Practice (2024)
    Preview abstract API Governance, the process of applying standardized sets of policies and guardrails to the design and development of APIs, has only grown in importance and prominence given the continued growth in APIs being produced. In this paper, we present an Action Research style approach to investigate and understand the utility of a multi-faceted API Governance process being adopted inside Google. We first reflect on past research around API Governance, and then introduce three new components, 1. API Improvement Proposals (AIPs) the documented source of truth for API design rules, 2. API Linter, an automated analysis tool which checks for adherence to / violations of AIPs, and 3. API Readability, a program to educate and certify API design experts. These three components are designed to build upon pre-existing processes to scale and improve API design. Through a mixed-methods research strategy, containing both a survey and a series of interviews, we evaluate the utility of these approaches in supporting API Producers. Our research shows that API Producers have positive sentiment towards API Governance, validating the general direction of the program. Specifically, our study participants highlighted the positive impact of API Governance on the quality of the APIs they produced, via consistency in both the outcome and approach. This paper also discusses future research opportunities to enhance API Governance, specifically with regards to newer API Producers, who reported worse sentiment towards the program than their more experienced peers. View details
    ConSmax: Hardware-Friendly Alternative Softmax with Learnable Parameters
    Shiwei Liu
    Guanchen Tao
    Yifei Zou
    Derek Chow
    Zichen Fan
    Kauna Lei
    Bangfei Pan
    Dennis Sylvester
    Mehdi Saligane
    Arxiv (2024)
    Preview abstract The self-attention mechanism sets transformer-based large language model (LLM) apart from the convolutional and recurrent neural networks. Despite the performance improvement, achieving real-time LLM inference on silicon is challenging due to the extensively used Softmax in self-attention. Apart from the non-linearity, the low arithmetic intensity greatly reduces the processing parallelism, which becomes the bottleneck especially when dealing with a longer context. To address this challenge, we propose Constant Softmax (ConSmax), a software-hardware co-design as an efficient Softmax alternative. ConSmax employs differentiable normalization parameters to remove the maximum searching and denominator summation in Softmax. It allows for massive parallelization while performing the critical tasks of Softmax. In addition, a scalable ConSmax hardware utilizing a bitwidth-split look-up table (LUT) can produce lossless non-linear operation and support mix-precision computing. It further facilitates efficient LLM inference. Experimental results show that ConSmax achieves a minuscule power consumption of 0.2 mW and area of 0.0008 mm^2 at 1250-MHz working frequency and 16-nm CMOS technology. Compared to state-of-the-art Softmax hardware, ConSmax results in 3.35x power and 2.75x area savings with a comparable accuracy on a GPT-2 model and the WikiText103 dataset. View details
    Preview abstract Prompting and in-context learning (ICL) have become efficient learning paradigms for large language models (LLMs). However, LLMs suffer from prompt brittleness and various bias factors in the prompt, including but not limited to the formatting, the choice verbalizers, and the ICL examples. To address this problem that results in unexpected performance degradation, calibration methods have been developed to mitigate the effects of these biases while recovering LLM performance. In this work, we first conduct a systematic analysis of the existing calibration methods, where we both provide a unified view and reveal the failure cases. Inspired by these analyses, we propose Batch Calibration (BC), a simple yet intuitive method that controls the contextual bias from the batched input, unifies various prior approaches, and effectively addresses the aforementioned issues. BC is zero-shot, inference-only, and incurs negligible additional costs. In the few-shot setup, we further extend BC to allow it to learn the contextual bias from labeled data. We validate the effectiveness of BC with PaLM 2-(S, M, L) and CLIP models and demonstrate state-of-the-art performance over previous calibration baselines across more than 10 natural language understanding and image classification tasks. View details
    Creating an Empirical Dermatology Dataset Through Crowdsourcing With Web Search Advertisements
    Abbi Ward
    Jimmy Li
    Julie Wang
    Sriram Lakshminarasimhan
    Ashley Carrick
    Jay Hartford
    Pradeep Kumar S
    Sunny Virmani
    Renee Wong
    Margaret Ann Smith
    Dawn Siegel
    Steven Lin
    Justin Ko
    JAMA Network Open (2024)
    Preview abstract Importance: Health datasets from clinical sources do not reflect the breadth and diversity of disease, impacting research, medical education, and artificial intelligence tool development. Assessments of novel crowdsourcing methods to create health datasets are needed. Objective: To evaluate if web search advertisements (ads) are effective at creating a diverse and representative dermatology image dataset. Design, Setting, and Participants: This prospective observational survey study, conducted from March to November 2023, used Google Search ads to invite internet users in the US to contribute images of dermatology conditions with demographic and symptom information to the Skin Condition Image Network (SCIN) open access dataset. Ads were displayed against dermatology-related search queries on mobile devices, inviting contributions from adults after a digital informed consent process. Contributions were filtered for image safety and measures were taken to protect privacy. Data analysis occurred January to February 2024. Exposure: Dermatologist condition labels as well as estimated Fitzpatrick Skin Type (eFST) and estimated Monk Skin Tone (eMST) labels. Main Outcomes and Measures: The primary metrics of interest were the number, quality, demographic diversity, and distribution of clinical conditions in the crowdsourced contributions. Spearman rank order correlation was used for all correlation analyses, and the χ2 test was used to analyze differences between SCIN contributor demographics and the US census. Results: In total, 5749 submissions were received, with a median of 22 (14-30) per day. Of these, 5631 (97.9%) were genuine images of dermatological conditions. Among contributors with self-reported demographic information, female contributors (1732 of 2596 contributors [66.7%]) and younger contributors (1329 of 2556 contributors [52.0%] aged <40 years) had a higher representation in the dataset compared with the US population. Of 2614 contributors who reported race and ethnicity, 852 (32.6%) reported a racial or ethnic identity other than White. Dermatologist confidence in assigning a differential diagnosis increased with the number of self-reported demographic and skin-condition–related variables (Spearman R = 0.1537; P < .001). Of 4019 contributions reporting duration since onset, 2170 (54.0%) reported onset within less than 7 days of submission. Of the 2835 contributions that could be assigned a dermatological differential diagnosis, 2523 (89.0%) were allergic, infectious, or inflammatory conditions. eFST and eMST distributions reflected the geographical origin of the dataset. Conclusions and Relevance: The findings of this survey study suggest that search ads are effective at crowdsourcing dermatology images and could therefore be a useful method to create health datasets. The SCIN dataset bridges important gaps in the availability of images of common, short-duration skin conditions. View details
    Health equity assessment of machine learning performance (HEAL): a framework and dermatology AI model case study
    Terry Spitz
    Malcolm Chelliah
    Heather Cole-Lewis
    Stephanie Farquhar
    Qinghan Xue
    Jenna Lester
    Cían Hughes
    Patricia Strachan
    Fraser Tan
    Peggy Bui
    Craig Mermel
    Lily Peng
    Sunny Virmani
    Ivor Horn
    Cameron Chen
    The Lancet eClinicalMedicine (2024)
    Preview abstract Background Artificial intelligence (AI) has repeatedly been shown to encode historical inequities in healthcare. We aimed to develop a framework to quantitatively assess the performance equity of health AI technologies and to illustrate its utility via a case study. Methods Here, we propose a methodology to assess whether health AI technologies prioritise performance for patient populations experiencing worse outcomes, that is complementary to existing fairness metrics. We developed the Health Equity Assessment of machine Learning performance (HEAL) framework designed to quantitatively assess the performance equity of health AI technologies via a four-step interdisciplinary process to understand and quantify domain-specific criteria, and the resulting HEAL metric. As an illustrative case study (analysis conducted between October 2022 and January 2023), we applied the HEAL framework to a dermatology AI model. A set of 5420 teledermatology cases (store-and-forward cases from patients of 20 years or older, submitted from primary care providers in the USA and skin cancer clinics in Australia), enriched for diversity in age, sex and race/ethnicity, was used to retrospectively evaluate the AI model's HEAL metric, defined as the likelihood that the AI model performs better for subpopulations with worse average health outcomes as compared to others. The likelihood that AI performance was anticorrelated to pre-existing health outcomes was estimated using bootstrap methods as the probability that the negated Spearman's rank correlation coefficient (i.e., “R”) was greater than zero. Positive values of R suggest that subpopulations with poorer health outcomes have better AI model performance. Thus, the HEAL metric, defined as p (R >0), measures how likely the AI technology is to prioritise performance for subpopulations with worse average health outcomes as compared to others (presented as a percentage below). Health outcomes were quantified as disability-adjusted life years (DALYs) when grouping by sex and age, and years of life lost (YLLs) when grouping by race/ethnicity. AI performance was measured as top-3 agreement with the reference diagnosis from a panel of 3 dermatologists per case. Findings Across all dermatologic conditions, the HEAL metric was 80.5% for prioritizing AI performance of racial/ethnic subpopulations based on YLLs, and 92.1% and 0.0% respectively for prioritizing AI performance of sex and age subpopulations based on DALYs. Certain dermatologic conditions were significantly associated with greater AI model performance compared to a reference category of less common conditions. For skin cancer conditions, the HEAL metric was 73.8% for prioritizing AI performance of age subpopulations based on DALYs. Interpretation Analysis using the proposed HEAL framework showed that the dermatology AI model prioritised performance for race/ethnicity, sex (all conditions) and age (cancer conditions) subpopulations with respect to pre-existing health disparities. More work is needed to investigate ways of promoting equitable AI performance across age for non-cancer conditions and to better understand how AI models can contribute towards improving equity in health outcomes. View details
    Preview abstract In-Context Learning (ICL) is an emergent capability of Large Language Models (LLMs). Only a few demonstrations enable LLMs to be used as blackbox for new tasks. Previous studies have shown that using LLMs' outputs as labels is effective in training models to select demonstrations. Such a label is expected to estimate utility of a demonstration in ICL; however, it has not been well understood how different labeling strategies affect results on target tasks. This paper presents an analysis on different utility functions by focusing on LLMs' output probability given ground-truth output, and task-specific reward given LLMs' prediction. Unlike the previous work, we introduce a novel labeling method, incremental utility, which estimates how much incremental knowledge is brought into the LLMs by a demonstration. We conduct experiments with instruction-tuned LLMs on binary/multi-class classification, segmentation, and translation across Arabic, English, Finnish, Japanese, and Spanish. Our results show that (1) the probability is effective when the probability values are distributed across the whole value range (on the classification tasks), and (2) the downstream metric is more robust when nuanced reward values are provided with long outputs (on the segmentation and translation tasks). We then show that the proposed incremental utility further helps ICL by contrasting how the LLMs perform with and without the demonstrations. View details
    Shadow Hamiltonian Simulation
    Rolando Somma
    Robbie King
    Thomas O'Brien
    arXiv:2407.21775 (2024)
    Preview abstract We present shadow Hamiltonian simulation, a framework for simulating quantum dynamics using a compressed quantum state that we call the “shadow state”. The amplitudes of this shadow state are proportional to the expectations of a set of operators of interest. The shadow state evolves according to its own Schrodinger equation, and under broad conditions can be simulated on a quantum computer. We analyze a number of applications of this framework to quantum simulation problems. This includes simulating the dynamics of exponentially large systems of free fermions, or exponentially large systems of free bosons, the latter example recovering a recent algorithm for simulating exponentially many classical harmonic oscillators. Shadow Hamiltonian simulation can be extended to simulate expectations of more complex operators such as two-time correlators or Green’s functions, and to study the evolution of operators themselves in the Heisenberg picture View details
    Preview abstract Large language models have demonstrated remarkable capabilities, but their performance is heavily reliant on effective prompt engineering. Automatic prompt optimization (APO) methods are designed to automate this and can be broadly categorized into those targeting instructions (instruction optimization, IO) vs. those targeting exemplars (exemplar selection, ES). Despite their shared objective, these have evolved rather independently, with IO recently receiving more research attention. This paper seeks to bridge this gap by comprehensively comparing the performance of representative IO and ES techniques, both isolation and combination, on a diverse set of challenging tasks. Our findings reveal that intelligently reusing model-generated input-output pairs obtained from evaluating prompts on the validation set as exemplars consistently improves performance over IO methods but is currently under-investigated. We also find that despite the recent focus on IO, how we select exemplars can outweigh how we optimize instructions, with ES strategies as simple as random search outperforming state-of-the-art IO methods with seed instructions without any optimization. Moreover, we observe synergy between ES and IO, with optimal combinations surpassing individual contributions. We conclude that studying exemplar selection as a standalone method and its optimal combination with instruction optimization remains a crucial aspect of APO and deserves greater consideration in future research, even in the era of highly capable instruction-following models. View details
    Preview abstract We study $L_2$ mean estimation under central differential privacy and communication constraints, and address two key challenges: firstly, existing mean estimation schemes that simultaneously handle both constraints are usually optimized for $L_\infty$ geometry and rely on random rotation or Kashin's representation to adapt to $L_2$ geometry, resulting in suboptimal leading constants in mean square errors (MSEs); secondly, schemes achieving order-optimal communication-privacy trade-offs do not extend seamlessly to streaming differential privacy (DP) settings (e.g., tree aggregation or matrix factorization), rendering them incompatible with DP-FTRL type optimizers. In this work, we tackle these issues by introducing a novel privacy accounting method for the sparsified Gaussian mechanism that incorporates the randomness inherent in sparsification into the DP noise. Unlike previous approaches, our accounting algorithm directly operates in $L_2$ geometry, yielding MSEs that fast converge to those of the uncompressed Gaussian mechanism. Additionally, we extend the sparsification scheme to the matrix factorization framework under streaming DP and provide a precise accountant tailored for DP-FTRL type optimizers. Empirically, our method demonstrates at least a 100x improvement of compression for DP-SGD across various FL tasks. View details