Publications
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Sort By
1 - 15 of 10105 publications
Assistive AI in Lung Cancer Screening: A Retrospective Multinational Study in the United States and Japan
Atilla Kiraly
Corbin Cunningham
Ryan Najafi
Jie Yang
Chuck Lau
Diego Ardila
Scott Mayer McKinney
Rory Pilgrim
Mozziyar Etemadi
Sunny Jansen
Lily Peng
Shravya Shetty
Neeral Beladia
Krish Eswaran
Radiology: Artificial Intelligence (2024)
Preview abstract
Lung cancer is the leading cause of cancer death world-wide with 1.8 million deaths in 20201. Studies have concluded that low-dose computed tomography lung cancer screening can reduce mortality by up to 61%2 and updated 2021 US guidelines expanded eligibility. As screening efforts rise, AI can play an important role, but must be unobtrusively integrated into existing clinical workflows. In this work, we introduce a state-of-the-art, cloud-based AI system providing lung cancer risk assessments without requiring any user input. We demonstrate its efficacy in assisting lung cancer screening under both US and Japanese screening settings using different patient populations and screening protocols. Technical improvements over a previously described system include a focus on earlier cancer detection for improved accuracy, introduction of an effective assistive user interface, and a system designed to integrate into typical clinical workflows. The stand-alone AI system was evaluated on 3085 individuals achieving area under the curve (AUC) scores of 91.7% (95%CI [89.6, 95.2]), 93.3% (95%CI [90.2, 95.7]), and 89.1% (95%CI [77.7, 97.3]) on three datasets (two from US and one from Japan), respectively. To evaluate the system’s assistive ability, we conducted two retrospective multi-reader multi-case studies on 627 cases read by experienced board certified radiologists (average 20 years of experience [7,40]) using local PACS systems in the respective US and Japanese screening settings. The studies measured the reader’s level of suspicion (LoS) and categorical responses for scores and management recommendations under country-specific screening protocols. The radiologists’ AUC for LoS increased with AI assistance by 2.3% (95%CI [0.1-4.5], p=0.022) for the US study and by 2.3% (95%CI [-3.5-8.1], p=0.179) for the Japan study. Specificity for recalls increased by 5.5% (95%CI [2.7-8.5], p<0.0001) for the US and 6.7% (95%CI [4.7-8.7], p<0.0001) for the Japan study. No significant reduction in other metrics occured. This work advances the state-of-the-art in lung cancer detection, introduces generalizable interface concepts that can be applicable to similar AI applications, and demonstrates its potential impact on diagnostic AI in global lung cancer screening with results suggesting a substantial drop in unnecessary follow-up procedures without impacting sensitivity.
View details
Securing the AI Software Supply Chain
Isaac Hepworth
Kara Olive
Kingshuk Dasgupta
Michael Le
Mark Lodato
Mihai Maruseac
Sarah Meiklejohn
Shamik Chaudhuri
Tehila Minkus
Google, Google, 1600 Amphitheatre Parkway, Mountain View, CA, 94043 (2024)
Preview abstract
As AI-powered features gain traction in software applications, we see many of the same problems we’ve faced with traditional software—but at an accelerated pace. The threat landscape continues to expand as AI is further integrated into everyday products, so we can expect more attacks. Given the expense of building models, there is a clear need for supply chain solutions.
This paper explains our approach to securing our AI supply chain using provenance information and provides guidance for other organizations. Although there are differences between traditional and AI development processes and risks, we can build on our work over the past decade using Binary Authorization for Borg (BAB), Supply-chain Levels for Software Artifacts (SLSA), and next-generation cryptographic signing solutions via Sigstore, and adapt these to the AI supply chain without reinventing the wheel. Depending on internal processes and platforms, each organization’s approach to AI supply chain security will look different, but the focus should be on areas where it can be improved in a relatively short time.
Readers should note that the first part of this paper provides a broad overview of “Development lifecycles for traditional and AI software”. Then we delve specifically into AI supply chain risks, and explain our approach to securing our AI supply chain using provenance information. More advanced practitioners may prefer to go directly to the sections on “AI supply chain risks,” “Controls for AI supply chain security,” or even the “Guidance for practitioners” section at the end of the paper, which can be adapted to the needs of any organization.
View details
Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting
Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) (2024)
Preview abstract
Ranking documents using Large Language Models (LLMs) by directly feeding the query and candidate documents into the prompt is an interesting and practical problem. However, researchers have found it difficult to outperform fine-tuned baseline rankers on benchmark datasets. We analyze pointwise and listwise ranking prompts used by existing methods and argue that off-the-shelf LLMs do not fully understand these challenging ranking formulations. In this paper, we propose to significantly reduce the burden on LLMs by using a new technique called Pairwise Ranking Prompting (PRP). Our results are the first in the literature to achieve state-of-the-art ranking performance on standard benchmarks using moderate-sized open-sourced LLMs. On TREC-DL 2019&2020, PRP based on the Flan-UL2 model with 20B parameters performs favorably with the previous best approach in the literature, which is based on the blackbox commercial GPT-4 that has 50x (estimated) model size, while outperforming other LLM-based solutions, such as InstructGPT which has 175B parameters, by over 10% for all ranking metrics. By using the same prompt template on seven BEIR tasks, PRP outperforms supervised baselines and outperforms the blackbox commercial ChatGPT solution by 4.2% and pointwise LLM-based solutions by more than 10% on average NDCG@10. Furthermore, we propose several variants of PRP to improve efficiency and show that it is possible to achieve competitive results even with linear complexity.
View details
Efficient Location Sampling Algorithms for Road Networks
Vivek Kumar
Ameya Velingker
Santhoshini Velusamy
WebConf (2024)
Preview abstract
Many geographic information systems applications rely on the data provided by user devices in the road network. Such applications include traffic monitoring, driving navigation, detecting road closures or the construction of new roads, etc. This signal is collected by sampling locations from the user trajectories and is a critical process for all such systems. Yet, it has not been sufficiently studied in the literature. The most natural way to sample a trajectory is perhaps using a frequency based algorithm, e.g., sample every $x$ seconds. However, as we argue in this paper, such a simple strategy can be very wasteful in terms of resources (e.g., server-side processing, user battery) and in terms of the amount of user data that it maintains. In this work we conduct a horizontal study of various location sampling algorithms (including frequency-based, road geography-based, reservoir-sampling based, etc.) and extract their trade-offs in terms of various metrics of interest, such as, the size of the stored data and the induced quality of training for prediction tasks (e.g., predicting speeds) using the road network of New York City.
View details
Preview abstract
Misgendering refers to the act of incorrectly identifying or addressing someone's gender.
While misgendering is both a factual inaccuracy and a toxic act of identity erasure, research on fact-checking and toxicity detection does not address it.
We are the first to bridge this gap by introducing a dataset, \dataset, to assist in developing interventions for misgendering.
The misgendering interventions task can be divided into two sub-tasks: (i) detecting misgendering, followed by (ii) editing misgendering where misgendering is present, in domains where editing is appropriate.
We introduce a dataset containing a total of 3806 instances of tweets, YouTube comments, and LLM-generated text about 30 non-cisgender individuals annotated for whether they contain misgendering or not.
LLM-generated text is also annotated for edits required to fix misgendering.
Using this dataset, we set initial benchmarks by evaluating existing NLP systems and highlight challenges for future models to address.
Additionally, we conducted a survey of non-cisgender individuals in the US to understand opinions about automated interventions for text-based misgendering.
We find interest for interventions along with concerns for potential harm.
View details
Optimizing quantum gates towards the scale of logical qubits
Alexandre Bourassa
Andrew Dunsworth
Will Livingston
Vlad Sivak
Trond Andersen
Yaxing Zhang
Desmond Chik
Jimmy Chen
Charles Neill
Alejo Grajales Dau
Anthony Megrant
Alexander Korotkov
Vadim Smelyanskiy
Yu Chen
Nature Communications, 15 (2024), pp. 2442
Preview abstract
A foundational assumption of quantum error correction theory is that quantum gates can be scaled to large processors without exceeding the error-threshold for fault tolerance. Two major challenges that could become fundamental roadblocks are manufacturing high-performance quantum hardware and engineering a control system that can reach its performance limits. The control challenge of scaling quantum gates from small to large processors without degrading performance often maps to non-convex, high-constraint, and time-dynamic control optimization over an exponentially expanding configuration space. Here we report on a control optimization strategy that can scalably overcome the complexity of such problems. We demonstrate it by choreographing the frequency trajectories of 68 frequency-tunable superconducting qubits to execute single- and two-qubit gates while mitigating computational errors. When combined with a comprehensive model of physical errors across our processor, the strategy suppresses physical error rates by ~3.7× compared with the case of no optimization. Furthermore, it is projected to achieve a similar performance advantage on a distance-23 surface code logical qubit with 1057 physical qubits. Our control optimization strategy solves a generic scaling challenge in a way that can be adapted to a variety of quantum operations, algorithms, and computing architectures.
View details
Complex Dynamics in Autobidding Systems
Georgios Piliouras
Kelly Spendlove
Proceedings of the 25th ACM Conference on Economics and Computation (2024)
Preview abstract
It has become the default in markets such as ad auctions for participants to bid in an auction through automated bidding agents (autobidders) which adjust bids over time to satisfy return-over-spend constraints. Despite the prominence of such systems for the internet economy, their resulting dynamical behavior is still not well understood. Although one might hope that such relatively simple systems would typically converge to the equilibria of their underlying auctions, we provide a plethora of results that show the emergence of complex behavior, such as bi-stability, periodic orbits and quasi periodicity. We empirically observe how the market structure (expressed as motifs) qualitatively affects the behavior of the dynamics. We complement it with theoretical results showing that autobidding systems can simulate both linear dynamical systems as well logical boolean gates.
View details
Learning from straggler clients in federated learning
Ehsan Amid
Rohan Anil
Arxiv (2024) (to appear)
Preview abstract
How well do existing federated learning algorithms learn from client devices that return model updates with a significant time delay? Is it even possible to learn effectively from clients that report back minutes, hours, or days after being scheduled? We answer these questions by developing Monte Carlo simulations of client latency that are guided by real-world applications. We compare well-known synchronous optimization algorithms like FedAvg and FedAdam with the state-of-the-art asynchronous FedBuff algorithm, and discover that these existing approaches often struggle to learn from severely delayed clients. To improve upon these, we experiment with modifications including distillation regularization and exponential moving averages of model weights. Finally, we invent two new algorithms, FARe-DUST and FeAST-on-MSG, based on distillation and averaging, respectively. Experiments with the EMNIST, CIFAR-100, and StackOverflow benchmark federated learning tasks demonstrate that our new algorithms outperform existing ones in terms of accuracy for straggler clients, while also providing better trade-offs between training time and total accuracy.
View details
Preview abstract
Service providers of large language model (LLM) applications collect user instructions in the wild and use them in further aligning LLMs with users’ intentions. These instructions, which potentially contain sensitive information, are annotated by human workers in the process. This poses a new privacy risk not addressed by the typical private optimization. To this end, we propose using synthetic instructions to replace real instructions in data annotation and model fine-tuning. Formal differential privacy is guaranteed by generating those synthetic instructions using privately fine-tuned generators. Crucial in achieving the desired utility is our novel filtering algorithm that matches the distribution of the synthetic instructions to that of the real ones. In both supervised fine-tuning and reinforcement learning from human feedback, our extensive experiments demonstrate the high utility of the final set of synthetic instructions by showing comparable results to real instructions. In supervised fine-tuning, models trained with private synthetic instructions outperform leading open-source models such as Vicuna
View details
Connecting Language Technologies with Rich, Diverse Data Sources Covering Thousands of Languages
Sebastian Ruder
Julia Kreutzer
Clara Rivera
Ishank Saxena
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Preview abstract
Contrary to common belief, there are rich and diverse data sources available for many thousands of languages, which can be used to develop technologies for these languages. In this paper, we provide an overview of some of the major online data sources, the types of data that they provide access to, potential applications of this data, and the number of languages that they cover. Even this covers only a small fraction of the data that exists; for example, printed books are published in many languages but few online aggregators exist.
View details
Sketching for Distributed Deep Learning: A Sharper Analysis
Mayank Shrivastava
Qiaobo Li
Sanmi Koyejo
Arindam Banerjee
Conference on Neural Information Processing Systems (NeurIPS) (2024)
Preview abstract
The high communication cost between the server and the clients is a significant bottleneck in scaling distributed learning for overparametrized deep models. One popular approach for reducing this communication overhead is randomized sketching. However, existing theoretical analyses for sketching-based distributed learning (sketch-DL) either incur a prohibitive dependence on the ambient dimension or need additional restrictive assumptions such as heavy-hitters. Nevertheless, despite existing pessimistic analyses, empirical evidence suggests that sketch-DL is competitive with its uncompressed counterpart -- thus motivating a sharper analysis. In this work, we introduce a sharper ambient dimension-independent convergence analysis for sketch-DL using the second-order geometry specified by the loss Hessian. Our results imply ambient dimension-independent communication complexity for sketch-DL. We present empirical results both on the loss Hessian and overall accuracy of sketch-DL supporting our theoretical results. Taken together, our results provide theoretical justification for the observed empirical success of sketch-DL.
View details
Productive Coverage: Improving the Actionability of Code Coverage
Gordon
Luka Kalinovcic
Mateusz Lewko
Rene Just
Yana Kulizhskaya
ICSE-SEIP '24: Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice (2024) (to appear)
Preview abstract
Code coverage is an intuitive and established test adequacy measure.
However, not all parts of the code base are equally important, and
hence additional testing may be critical for some uncovered code,
whereas it may not be worthwhile for other uncovered code. As a
result, simply visualizing uncovered code is not reliably actionable.
To make code coverage actionable and further improve code
coverage in our codebase, we developed Productive Coverage —
a novel approach to code coverage that guides developers to uncovered code that that should be tested by (unit) tests. Specifically,
Productive Coverage identifies uncovered code that is similar to
existing code, which in turn is tested and/or frequently executed in
production. We implemented and evaluated Productive Coverage
for four programming languages (C++, Java, Go, and Python). The
evaluation shows: (1) The developer sentiment, measured at the
point of use, is strongly positive; (2) Productive Coverage meaningfully increases code coverage above a strong baseline; (3) Productive
Coverage has no negative effect on code authoring efficiency; (4)
Productive Coverage modestly improves code-review effiency; (5)
Productive Coverage directly improves code quality and prevents
bugs from being introduced, in addition to improving test quality
View details
SEMQA: Semi-Extractive Multi-Source Question Answering
Haitian Sun
NAACL (2024) (to appear)
Preview abstract
Recently proposed long-form question answering (QA) systems, supported by large language models (LLMs), have shown promising capabilities. Yet, attributing and verifying their generated abstractive answers can be difficult, and automatically evaluating their accuracy remains an ongoing challenge.
In this paper, we introduce a new QA task for answering multi-answer questions by summarizing multiple diverse sources in a semi-extractive fashion. Specifically, Semi-extractive Multi-source QA (SEMQA) requires models to output a comprehensive answer while mixing between factual quoted spans---copied verbatim from given input sources---and non-factual free-text connectors that glue these spans together into a single cohesive passage. This setting bridges the gap between the outputs of well-grounded but constrained extractive QA systems and more fluent but harder to attribute fully abstractive answers. Particularly, it enables a new mode for language models that leverages their advanced language generation capabilities, while also producing fine in-line attributions by-design that are easy to verify, interpret, and evaluate. To study this task, we create the first dataset of this kind with human-written semi-extractive answers to natural and generated questions, and define text-based evaluation metrics. Experimenting with several LLMs in various settings, we find this task to be surprisingly challenging, demonstrating the importance of our work for developing and studying such consolidation capabilities.
View details
Preview abstract
Middle-mile logistics describes the problem of routing shipments through a network of hubs while respecting deadlines upon arrival. We consider that the hubs are linked by predefined lines, to which we have to assign vehicles. A very challenging aspect of the problem comes from the finite capacity of the vehicles: allocating a shipment to a given vehicle might block another one from using the same vehicle.
Typical exact solution methods, based on a multicommodity-flow formulation, scale poorly with the problem size and real-world instances become quickly intractable. Instead, we turn to reinforcement learning (RL) by rephrasing the middle-mile problem as a multi-objective Markov decision process, where the state is a graph: the lines (edges) between the hubs and the parcels (nodes). At each round, we assign one shipment to a vehicle or decide that it stays in the same hub. The key ingredients of our proposed method are the extraction of small feature graphs from the state and the combination of graph neural networks (GNN) with model-free RL.
We use the PPO (proximal policy optimization) algorithm, which maintains both an actor and a critic, while being able to cope with a varying number of actions depending on the state. We compare linear functions and GraphNet (a particular kind of GNN) to approximate the policy and value functions. GNNs can deliver up to 40% more shipments than a linear function and both approaches scale well with the number of shipments per truck.
View details
Preview abstract
Floods are one of the most common natural disasters, with a disproportionate impact in developing countries that often lack dense streamflow gauge networks. Accurate and timely warnings are critical for mitigating flood risks, but hydrological simulation models typically must be calibrated to long data records in each watershed. Here we show that AI-based forecasting achieves reliability in predicting extreme riverine events in ungauged watersheds at up to a 5-day lead time that is similar to or better than the reliability of nowcasts (0-day lead time) from a current state of the art global modeling system (the Copernicus Emergency Management Service Global Flood Awareness System). Additionally, we achieve accuracies over 5-year return period events that are similar to or better than current accuracies over 1-year return period events. This means that AI can provide flood warnings earlier and over larger and more impactful events in ungauged basins. The model developed in this paper was incorporated into an operational early warning system that produces publicly available (free and open) forecasts in real time in over 80 countries. This work highlights a need for increasing the availability of hydrological data to continue to improve global access to reliable flood warnings.
View details