Publications
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Sort By
1 - 15 of 133 publications
Towards Generalist Biomedical AI
Danny Driess
Andrew Carroll
Chuck Lau
Ryutaro Tanno
Ira Ktena
Anil Palepu
Basil Mustafa
Aakanksha Chowdhery
Simon Kornblith
Philip Mansfield
Sushant Prakash
Renee Wong
Sunny Virmani
Sara Mahdavi
Bradley Green
Ewa Dominowska
Joelle Barral
Karan Singhal
Pete Florence
NEJM AI (2024)
Preview abstract
BACKGROUND: Medicine is inherently multimodal, requiring the simultaneous interpretation and integration of insights between many data modalities spanning text, imaging, genomics, and more. Generalist biomedical artificial intelligence systems that flexibly encode, integrate, and interpret these data might better enable impactful applications ranging from scientific discovery to care delivery.
METHODS: To catalyze development of these models, we curated MultiMedBench, a new multimodal biomedical benchmark. MultiMedBench encompasses 14 diverse tasks, such as medical question answering, mammography and dermatology image interpretation, radiology report generation and summarization, and genomic variant calling. We then introduced Med-PaLM Multimodal (Med-PaLM M), our proof of concept for a generalist biomedical AI system that flexibly encodes and interprets biomedical data including clinical language, imaging, and genomics with the same set of model weights. To further probe the capabilities and limitations of Med-PaLM M, we conducted a radiologist evaluation of model-generated (and human) chest x-ray reports.
RESULTS: We observed encouraging performance across model scales. Med-PaLM M reached performance competitive with or exceeding the state of the art on all MultiMedBench tasks, often surpassing specialist models by a wide margin. In a side-by-side ranking on 246 retrospective chest x-rays, clinicians expressed a pairwise preference for Med-PaLM Multimodal reports over those produced by radiologists in up to 40.50% of cases, suggesting potential clinical utility.
CONCLUSIONS: Although considerable work is needed to validate these models in real-world cases and understand if cross-modality generalization is possible, our results represent a milestone toward the development of generalist biomedical artificial intelligence systems.
View details
An intentional approach to managing bias in embedding models
Atilla P. Kiraly
Jungyeon Park
Rory Pilgrim
Charles Lau
Heather Cole-Lewis
Shravya Shetty
Krish Eswaran
Leo Anthony Celi
The Lancet Digital Health, 6 (2024), E126-E130
Preview abstract
Advances in machine learning for health care have brought concerns about bias from the research community; specifically, the introduction, perpetuation, or exacerbation of care disparities. Reinforcing these concerns is the finding that medical images often reveal signals about sensitive attributes in ways that are hard to pinpoint by both algorithms and people. This finding raises a question about how to best design general purpose pretrained embeddings (GPPEs, defined as embeddings meant to support a broad array of use cases) for building downstream models that are free from particular types of bias. The downstream model should be carefully evaluated for bias, and audited and improved as appropriate. However, in our view, well intentioned attempts to prevent the upstream components—GPPEs—from learning sensitive attributes can have unintended consequences on the downstream models. Despite producing a veneer of technical neutrality, the resultant end-to-end system might still be biased or poorly performing. We present reasons, by building on previously published data, to support the reasoning that GPPEs should ideally contain as much information as the original data contain, and highlight the perils of trying to remove sensitive attributes from a GPPE. We also emphasise that downstream prediction models trained for specific tasks and settings, whether developed using GPPEs or not, should be carefully designed and evaluated to avoid bias that makes models vulnerable to issues such as distributional shift. These evaluations should be done by a diverse team, including social scientists, on a diverse cohort representing the full breadth of the patient population for which the final model is intended.
View details
Demystifying Embedding Spaces using Large Language Models
Jihwan Jeong
Lior Shani
Martin Mladenov
The Twelfth International Conference on Learning Representations (2024)
Preview abstract
Embeddings have become a pivotal means to represent complex, multi-faceted information about entities, concepts, and relationships in a condensed and useful format. Nevertheless, they often preclude direct interpretation. While downstream tasks make use of these compressed representations, meaningful interpretation usually requires visualization using dimensionality reduction or specialized machine learning interpretability methods. This paper addresses the challenge of making such embeddings more interpretable and broadly useful, by employing large language models (LLMs) to directly interact with embeddings -- transforming abstract vectors into understandable narratives. By injecting embeddings into LLMs, we enable querying and exploration of complex embedding data. We demonstrate our approach on a variety of diverse tasks, including: enhancing concept activation vectors (CAVs), communicating novel embedded entities, and decoding user preferences in recommender systems. Our work couples the immense information potential of embeddings with the interpretative power of LLMs.
View details
Preview abstract
Artificial Intelligence (AI) holds the promise of transforming healthcare by improving patient outcomes, increasing accessibility and efficiency, and decreasing the cost of care. Realizing this vision of a healthier world for everyone everywhere requires partnerships and trust between healthcare systems, clinicians, payers, technology companies, pharmaceutical companies, and governments to drive innovations in machine learning and artificial intelligence to patients. Google is one example of a technology company that is partnering with healthcare systems, clinicians, and researchers to develop technology solutions that will directly improve the lives of patients. In this chapter we share landmark trials of the use of AI in healthcare. We also describe the application of our novel system of organizing information to unify data in electronic health records (EHRs) and bring an integrated view of patient records to clinicians. We discuss our consumer focused innovation in dermatology to help guide search journeys for personalized information about skin conditions. Finally, we share a perspective on how to embed ethics and a concern for all patients into the development of AI.
View details
Preview abstract
Floods are one of the most common natural disasters, with a disproportionate impact in developing countries that often lack dense streamflow gauge networks. Accurate and timely warnings are critical for mitigating flood risks, but hydrological simulation models typically must be calibrated to long data records in each watershed. Here we show that AI-based forecasting achieves reliability in predicting extreme riverine events in ungauged watersheds at up to a 5-day lead time that is similar to or better than the reliability of nowcasts (0-day lead time) from a current state of the art global modeling system (the Copernicus Emergency Management Service Global Flood Awareness System). Additionally, we achieve accuracies over 5-year return period events that are similar to or better than current accuracies over 1-year return period events. This means that AI can provide flood warnings earlier and over larger and more impactful events in ungauged basins. The model developed in this paper was incorporated into an operational early warning system that produces publicly available (free and open) forecasts in real time in over 80 countries. This work highlights a need for increasing the availability of hydrological data to continue to improve global access to reliable flood warnings.
View details
Preview abstract
Importance: Interest in artificial intelligence (AI) has reached an all-time high, and health care leaders across the ecosystem are faced with questions about where, when, and how to deploy AI and how to understand its risks, problems, and possibilities.
Observations: While AI as a concept has existed since the 1950s, all AI is not the same. Capabilities and risks of various kinds of AI differ markedly, and on examination 3 epochs of AI emerge. AI 1.0 includes symbolic AI, which attempts to encode human knowledge into computational rules, as well as probabilistic models. The era of AI 2.0 began with deep learning, in which models learn from examples labeled with ground truth. This era brought about many advances both in people’s daily lives and in health care. Deep learning models are task-specific, meaning they do one thing at a time, and they primarily focus on classification and prediction. AI 3.0 is the era of foundation models and generative AI. Models in AI 3.0 have fundamentally new (and potentially transformative) capabilities, as well as new kinds of risks, such as hallucinations. These models can do many different kinds of tasks without being retrained on a new dataset. For example, a simple text instruction will change the model’s behavior. Prompts such as “Write this note for a specialist consultant” and “Write this note for the patient’s mother” will produce markedly different content.
Conclusions and Relevance: Foundation models and generative AI represent a major revolution in AI’s capabilities, ffering tremendous potential to improve care. Health care leaders are making decisions about AI today. While any heuristic omits details and loses nuance, the framework of AI 1.0, 2.0, and 3.0 may be helpful to decision-makers because each epoch has fundamentally different capabilities and risks.
View details
Preview abstract
As instruction-tuned large language models (LLMs) gain global adoption, their ability to follow instructions in multiple languages becomes increasingly crucial. In this work, we investigate how multilinguality during instruction tuning of a multilingual LLM affects instruction-following across languages from the pre-training corpus. We first show that many languages transfer some instruction-following capabilities to other languages from even monolingual tuning. Furthermore, we find that only 40 multilingual examples integrated in an English tuning set substantially improve multilingual instruction-following, both in seen and unseen languages during tuning. In general, we observe that models tuned on multilingual mixtures exhibit comparable or superior performance in multiple languages compared to monolingually tuned models, despite training on 10x fewer examples in those languages. Finally, we find that diversifying the instruction tuning set with even just 2-4 languages significantly improves cross-lingual generalization. Our results suggest that building massively multilingual instruction-tuned models can be done with only a very small set of multilingual instruction-responses.
View details
Unsupervised representation learning on high-dimensional clinical data improves genomic discovery and prediction
Babak Behsaz
Zachary Ryan Mccaw
Davin Hill
Robert Luben
Dongbing Lai
John Bates
Howard Yang
Tae-Hwi Schwantes-An
Yuchen Zhou
Anthony Khawaja
Andrew Carroll
Brian Hobbs
Michael Cho
Nature Genetics (2024)
Preview abstract
Although high-dimensional clinical data (HDCD) are increasingly available in biobank-scale datasets, their use for genetic discovery remains challenging. Here we introduce an unsupervised deep learning model, Representation Learning for Genetic Discovery on Low-Dimensional Embeddings (REGLE), for discovering associations between genetic variants and HDCD. REGLE leverages variational autoencoders to compute nonlinear disentangled embeddings of HDCD, which become the inputs to genome-wide association studies (GWAS). REGLE can uncover features not captured by existing expert-defined features and enables the creation of accurate disease-specific polygenic risk scores (PRSs) in datasets with very few labeled data. We apply REGLE to perform GWAS on respiratory and circulatory HDCD—spirograms measuring lung function and photoplethysmograms measuring blood volume changes. REGLE replicates known loci while identifying others not previously detected. REGLE are predictive of overall survival, and PRSs constructed from REGLE loci improve disease prediction across multiple biobanks. Overall, REGLE contain clinically relevant information beyond that captured by existing expert-defined features, leading to improved genetic discovery and disease prediction.
View details
Preview abstract
We study the space complexity of the two related fields of differential privacy and adaptive data analysis. Specifically,
(1) Under standard cryptographic assumptions, we show that there exists a problem P that requires exponentially more space to be solved efficiently with differential privacy, compared to the space needed without privacy. To the best of our knowledge, this is the first separation between the space complexity of private and non-private algorithms.
(2) The line of work on adaptive data analysis focuses on understanding the number of samples needed for answering a sequence of adaptive queries. We revisit previous lower bounds at a foundational level, and show that they are a consequence of a space bottleneck rather than a sampling bottleneck.
To obtain our results, we define and construct an encryption scheme with multiple keys that is built to withstand a limited amount of key leakage in a very particular way.
View details
Preview abstract
We introduce the concurrent shuffle model of differential privacy. In this model we have multiple concurrent shufflers permuting messages from different, possibly overlapping, batches of users. Similarly to the standard (single) shuffle model, the privacy requirement is that the concatenation of all shuffled messages should be differentially private. We study the private continual summation problem (a.k.a. the counter problem) and show that
the concurrent shuffle model allows for significantly improved error compared to a standard (single) shuffle model. Specifically, we give a summation algorithm with error $\Tilde{O}(n^{1/(2k+1)})$ with $k$ concurrent shufflers on a sequence of length $n$. Furthermore, we prove that this bound is tight for any $k$, even if the algorithm can choose the sizes of the batches adaptively. For $k=\log n$ shufflers, the resulting error is polylogarithmic, much better than $\Tilde{\Theta}(n^{1/3})$ which we show is the smallest possible with a single shuffler.
We use our online summation algorithm to get algorithms with improved regret bounds for the contextual linear bandit problem. In particular we get optimal $\Tilde{O}(\sqrt{n})$ regret with $k= \Tilde{\Omega}(\log n)$ concurrent shufflers.
View details
Preview abstract
We develop a deep learning based convolutional-regression model that estimates the volumetric soil moisture content in the top ~5 cm. Input predictors include Sentinel-1 (active radar), Sentinel-2 (optical imagery), and SMAP (passive radar) as well as geophysical variables from SoilGrids and modelled soil moisture fields from GLDAS. The model was trained and evaluated on data from ~1300 in-situ sensors globally over the period 2015 - 2021 and obtained an average per-sensor correlation of 0.72 and ubRMSE of 0.0546. These results are benchmarked against 13 other soil moisture estimates at different locations, and an ablation study was used to identify important predictors.
View details
Cost-utility analysis of deep learning and trained human graders for diabetic retinopathy screening in a nationwide program
Attasit Srisubat
Kankamon Kittrongsiri
Sermsiri Sangroongruangsri
Chalida Khemvaranan
Jacqueline Shreibati
John Hernandez
Fred Hersch
Prut Hanutsaha
Varis Ruamviboonsuk
Saowalak Turongkaravee
Rajiv Raman
Dr. Paisan Raumviboonsuk
Ophthalmology (2023)
Preview abstract
Introduction
Deep learning (DL) for screening diabetic retinopathy (DR) has the potential to address limited healthcare resources by enabling expanded access to healthcare. However, there is still limited health economic evaluation, particularly in low- and middle-income countries, on this subject to aid decision-making for DL adoption.
Methods
In the context of a middle-income country (MIC), using Thailand as a model, we constructed a decision tree-Markov hybrid model to estimate lifetime costs and outcomes of Thailand’s national DR screening program via DL and trained human graders (HG). We calculated the incremental cost-effectiveness ratio (ICER) between the two strategies. Sensitivity analyses were performed to probe the influence of modeling parameters.
Results
From a societal perspective, screening with DL was associated with a reduction in costs of ~ US$ 2.70, similar quality-adjusted life-years (QALY) of + 0.0043, and an incremental net monetary benefit of ~ US$ 24.10 in the base case. In sensitivity analysis, DL remained cost-effective even with a price increase from US$ 1.00 to US$ 4.00 per patient at a Thai willingness-to-pay threshold of ~ US$ 4.997 per QALY gained. When further incorporating recent findings suggesting improved compliance to treatment referral with DL, our analysis models effectiveness benefits of ~ US$ 20 to US$ 50 depending on compliance.
Conclusion
DR screening using DL in an MIC using Thailand as a model may result in societal cost-savings and similar health outcomes compared with HG. This study may provide an economic rationale to expand DL-based DR screening in MICs as an alternative solution for limited availability of skilled human resources for primary screening, particularly in MICs with similar prevalence of diabetes and low compliance to referrals for treatment.
View details
Pushing the Accuracy-Group Robustness Tradeoff Frontier with Introspective Self-play
Dj Dvijotham
Jihyeon Lee
Martin Strobel
Quan Yuan
ICLR'23 (2023) (to appear)
Preview abstract
Improving the accuracy-fairness frontier of deep neural network (DNN) models is an important problem. Uncertainty-based active learning active learning (AL)can potentially improve the frontier by preferentially sampling underrepresented subgroups to create a more balanced training dataset. However, the quality of uncertainty estimates from modern DNNs tend to degrade in the presence of spurious correlations and dataset bias, compromising the effectiveness of AL for sampling tail groups. In this work, we propose Introspective Self-play (ISP), a simple approach to improve the uncertainty estimation of a deep neural network under dataset bias, by adding an auxiliary introspection task requiring a model to predict the bias for each data point in addition to the label. We show that ISP provably improves the bias-awareness of the model representation and the resulting uncertainty estimates. On two real-world tabular and language tasks, ISP serves as a simple “plug-in” for AL model training, consistently improving both the tail-group sampling rate and the final accuracy-fairness trade-off frontier of popular AL methods.
View details
Reinforcement Learning with History Dependent Dynamic Contexts
Nadav Merlis
Martin Mladenov
Proceedings of the 40th International Conference on Machine Learning (ICML 2023), Honolulu, Hawaii
Preview abstract
We introduce a framework for modeling and solving reinforcement learning problems in non-Markovian, history-dependent environments. Our framework, called the Dynamic Contextual Markov Decision Process (DCMDP), generalizes the contextual MDP framework to handle non-Markov environments where contexts change over time. To overcome the exponential dependence on history, we leverage an aggregated mapping of previous visits to states, actions and contexts to construct an optimistic upper confidence-based algorithm, for which we establish regret bounds. Motivated by our theoretical results, we introduce a practical model-based algorithm that addresses history-dependent contexts, by planing in a latent space and using optimism over history-dependent features. We demonstrate the efficiency and performance of our approach on a recommendation task using the MovieLens dataset, in which the user's behavior is influenced by the agent's recommendations and changes over time.
View details
Preview abstract
The problem of learning threshold functions is a fundamental one in machine learning. Classical learning theory implies sample complexity of $O(\xi^{-1} \log(1/\beta))$ (for generalization error $\xi$ with confidence $1-\beta$). The private version of the problem, however, is more challenging and in particular, the sample complexity must depend on the size $|X|$ of the domain. Progress on quantifying this dependence, via lower and upper bounds, was made in a line of works over the past decade. In this paper, we finally close the gap for approximate-DP and provide a nearly tight upper bound of $\widetilde{O}(\log^* |X|)$, which matches a lower bound by Alon et al (that applies even with improper learning) and improves over a prior upper bound of $\widetilde{O}((\log^* |X|)^{1.5})$ by Kaplan et al. We also provide matching upper and lower bounds of $\tilde{\Theta}(2^{\log^*|X|})$ for the additive error of private quasi-concave optimization (a related and more general problem). Our improvement is achieved via the novel Reorder-Slice-Compute paradigm for private data analysis which we believe will have further applications.
View details