Alan Karthikesalingam

Alan Karthikesalingam

Alan is a clinician and Research Scientist working on Foundation Models for health, most recently including Med-PaLM, Med-PaLM-2, Med-PaLM-Multimodal and AMIE. Prior to this his work at DeepMind and Google explored applications of AI in radiology, ophthalmology, dermatology and electronic health records, resulting in papers published in Nature and Nature Medicine. He is an honorary Lecturer in Vascular Surgery at Imperial College in London. He completed his MA in Neuroscience and Medical Degree (MBBChir) at the University of Cambridge before specialist training in surgery in the London Deanery, where he completed his Membership of the Royal College of Surgeons (MRCS), PhD in Vascular Surgery and was appointed as a NIHR Clinical Lecturer. In 2017 he joined DeepMind's health research team and in 2019 joined Google Health. Prior to joining Google he had published over 150 peer-reviewed articles including first-author studies in the New England Journal of Medicine and The Lancet.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Advances in machine learning for health care have brought concerns about bias from the research community; specifically, the introduction, perpetuation, or exacerbation of care disparities. Reinforcing these concerns is the finding that medical images often reveal signals about sensitive attributes in ways that are hard to pinpoint by both algorithms and people. This finding raises a question about how to best design general purpose pretrained embeddings (GPPEs, defined as embeddings meant to support a broad array of use cases) for building downstream models that are free from particular types of bias. The downstream model should be carefully evaluated for bias, and audited and improved as appropriate. However, in our view, well intentioned attempts to prevent the upstream components—GPPEs—from learning sensitive attributes can have unintended consequences on the downstream models. Despite producing a veneer of technical neutrality, the resultant end-to-end system might still be biased or poorly performing. We present reasons, by building on previously published data, to support the reasoning that GPPEs should ideally contain as much information as the original data contain, and highlight the perils of trying to remove sensitive attributes from a GPPE. We also emphasise that downstream prediction models trained for specific tasks and settings, whether developed using GPPEs or not, should be carefully designed and evaluated to avoid bias that makes models vulnerable to issues such as distributional shift. These evaluations should be done by a diverse team, including social scientists, on a diverse cohort representing the full breadth of the patient population for which the final model is intended. View details
    Understanding metric-related pitfalls in image analysis validation
    Annika Reinke
    Lena Maier-Hein
    Paul Jager
    Shravya Shetty
    Understanding Metrics Workgroup
    Nature Methods (2024)
    Preview abstract Validation metrics are key for the reliable tracking of scientific progress and for bridging the current chasm between artificial intelligence (AI) research and its translation into practice. However, increasing evidence shows that particularly in image analysis, metrics are often chosen inadequately in relation to the underlying research problem. This could be attributed to a lack of accessibility of metric-related knowledge: While taking into account the individual strengths, weaknesses, and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multi-stage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides the first reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Focusing on biomedical image analysis but with the potential of transfer to other fields, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. To facilitate comprehension, illustrations and specific examples accompany each pitfall. As a structured body of information accessible to researchers of all levels of expertise, this work enhances global comprehension of a key topic in image analysis validation. View details
    Towards Generalist Biomedical AI
    Danny Driess
    Andrew Carroll
    Chuck Lau
    Ryutaro Tanno
    Ira Ktena
    Anil Palepu
    Basil Mustafa
    Aakanksha Chowdhery
    Simon Kornblith
    Philip Mansfield
    Sushant Prakash
    Renee Wong
    Sunny Virmani
    Sara Mahdavi
    Bradley Green
    Ewa Dominowska
    Joelle Barral
    Karan Singhal
    Pete Florence
    NEJM AI(2024)
    Preview abstract BACKGROUND: Medicine is inherently multimodal, requiring the simultaneous interpretation and integration of insights between many data modalities spanning text, imaging, genomics, and more. Generalist biomedical artificial intelligence systems that flexibly encode, integrate, and interpret these data might better enable impactful applications ranging from scientific discovery to care delivery. METHODS: To catalyze development of these models, we curated MultiMedBench, a new multimodal biomedical benchmark. MultiMedBench encompasses 14 diverse tasks, such as medical question answering, mammography and dermatology image interpretation, radiology report generation and summarization, and genomic variant calling. We then introduced Med-PaLM Multimodal (Med-PaLM M), our proof of concept for a generalist biomedical AI system that flexibly encodes and interprets biomedical data including clinical language, imaging, and genomics with the same set of model weights. To further probe the capabilities and limitations of Med-PaLM M, we conducted a radiologist evaluation of model-generated (and human) chest x-ray reports. RESULTS: We observed encouraging performance across model scales. Med-PaLM M reached performance competitive with or exceeding the state of the art on all MultiMedBench tasks, often surpassing specialist models by a wide margin. In a side-by-side ranking on 246 retrospective chest x-rays, clinicians expressed a pairwise preference for Med-PaLM Multimodal reports over those produced by radiologists in up to 40.50% of cases, suggesting potential clinical utility. CONCLUSIONS: Although considerable work is needed to validate these models in real-world cases and understand if cross-modality generalization is possible, our results represent a milestone toward the development of generalist biomedical artificial intelligence systems. View details
    Generative models improve fairness of medical classifiers under distribution shifts
    Ira Ktena
    Olivia Wiles
    Isabela Albuquerque
    Sylvestre-Alvise Rebuffi
    Ryutaro Tanno
    Danielle Belgrave
    Taylan Cemgil
    Nature Medicine(2024)
    Preview abstract Domain generalization is a ubiquitous challenge for machine learning in healthcare. Model performance in real-world conditions might be lower than expected because of discrepancies between the data encountered during deployment and development. Underrepresentation of some groups or conditions during model development is a common cause of this phenomenon. This challenge is often not readily addressed by targeted data acquisition and ‘labeling’ by expert clinicians, which can be prohibitively expensive or practically impossible because of the rarity of conditions or the available clinical expertise. We hypothesize that advances in generative artificial intelligence can help mitigate this unmet need in a steerable fashion, enriching our training dataset with synthetic examples that address shortfalls of underrepresented conditions or subgroups. We show that diffusion models can automatically learn realistic augmentations from data in a label-efficient manner. We demonstrate that learned augmentations make models more robust and statistically fair in-distribution and out of distribution. To evaluate the generality of our approach, we studied three distinct medical imaging contexts of varying difficulty: (1) histopathology, (2) chest X-ray and (3) dermatology images. Complementing real samples with synthetic ones improved the robustness of models in all three medical tasks and increased fairness by improving the accuracy of clinical diagnosis within underrepresented groups, especially out of distribution. View details
    Preview abstract Background Health datasets from clinical sources do not reflect the breadth and diversity of disease in the real world, impacting research, medical education and artificial intelligence (AI) tool development. Dermatology is a suitable area to develop and test a new and scalable method to create representative health datasets. Methods We used Google Search advertisements to solicit contributions of images of dermatology conditions, demographic and symptom information from internet users in the United States (US) over 265 days starting March 2023. With informed contributor consent, we described and released this dataset containing 10,106 images from 5058 contributions, with dermatologist labels as well as Fitzpatrick Skin Type and Monk Skin Tone labels for the images. Results We received 22 ± 14 submissions/day over 265 days. Female contributors (66.04%) and younger individuals (52.3% < age 40) had a higher representation in the dataset compared to the US population, and 36.6% of contributors had a non-White racial or ethnic identity. Over 97.5% of contributions were genuine images of skin conditions. Image quality had no impact on dermatologist confidence in assigning a differential diagnosis. The dataset consists largely of short duration (54% with onset < 7 days ago) allergic, infectious, and inflammatory conditions. Fitzpatrick skin type distribution is well-balanced, considering the geographical origin of the dataset and the absence of enrichment for population groups or skin tones. Interpretation Search ads are effective at crowdsourcing images of health conditions. The SCIN dataset bridges important gaps in the availability of representative images of common skin conditions. View details
    Conversational AI in health: Design considerations from a Wizard-of-Oz dermatology case study with users, clinicians and a medical LLM
    Brenna Li
    Amy Wang
    Patricia Strachan
    Julie Anne Seguin
    Sami Lachgar
    Karyn Schroeder
    Renee Wong
    Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems, Association for Computing Machinery, pp. 10
    Preview abstract Although skin concerns are common, access to specialist care is limited. Artificial intelligence (AI)-assisted tools to support medical decisions may provide patients with feedback on their concerns while also helping ensure the most urgent cases are routed to dermatologists. Although AI-based conversational agents have been explored recently, how they are perceived by patients and clinicians is not well understood. We conducted a Wizard-of-Oz study involving 18 participants with real skin concerns. Participants were randomly assigned to interact with either a clinician agent (portrayed by a dermatologist) or an LLM agent (supervised by a dermatologist) via synchronous multimodal chat. In both conditions, participants found the conversation to be helpful in understanding their medical situation and alleviate their concerns. Through qualitative coding of the conversation transcripts, we provide insight on the importance of empathy and effective information-seeking. We conclude with design considerations for future AI-based conversational agents in healthcare settings. View details
    Towards Conversational Diagnostic AI
    Anil Palepu
    Khaled Saab
    Jan Freyberg
    Ryutaro Tanno
    Amy Wang
    Brenna Li
    Nenad Tomašev
    Karan Singhal
    Le Hou
    Albert Webson
    Kavita Kulkarni
    Sara Mahdavi
    Juro Gottweis
    Joelle Barral
    Kat Chou
    Arxiv(2024) (to appear)
    Preview abstract At the heart of medicine lies the physician-patient dialogue, where skillful history-taking paves the way for accurate diagnosis, effective management, and enduring trust. Artificial Intelligence (AI) systems capable of diagnostic dialogue could increase accessibility, consistency, and quality of care. However, approximating clinicians' expertise is an outstanding grand challenge. Here, we introduce AMIE (Articulate Medical Intelligence Explorer), a Large Language Model (LLM) based AI system optimized for diagnostic dialogue. AMIE uses a novel self-play based simulated environment with automated feedback mechanisms for scaling learning across diverse disease conditions, specialties, and contexts. We designed a framework for evaluating clinically-meaningful axes of performance including history-taking, diagnostic accuracy, management reasoning, communication skills, and empathy. We compared AMIE's performance to that of primary care physicians (PCPs) in a randomized, double-blind crossover study of text-based consultations with validated patient actors in the style of an Objective Structured Clinical Examination (OSCE). The study included 149 case scenarios from clinical providers in Canada, the UK, and India, 20 PCPs for comparison with AMIE, and evaluations by specialist physicians and patient actors. AMIE demonstrated greater diagnostic accuracy and superior performance on 28 of 32 axes according to specialist physicians and 24 of 26 axes according to patient actors. Our research has several limitations and should be interpreted with appropriate caution. Clinicians were limited to unfamiliar synchronous text-chat which permits large-scale LLM-patient interactions but is not representative of usual clinical practice. While further research is required before AMIE could be translated to real-world settings, the results represent a milestone towards conversational diagnostic AI. View details
    Preview abstract Large language models (LLMs) hold immense promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. In this work, we present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and then conduct an empirical case study with Med-PaLM 2, resulting in the largest human evaluation study in this area to date. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases, and EquityMedQA, a collection of seven newly-released datasets comprising both manually-curated and LLM-generated questions enriched for adversarial queries. Both our human assessment framework and dataset design process are grounded in an iterative participatory approach and review of possible biases in Med-PaLM 2 answers to adversarial queries. Through our empirical study, we find that the use of a collection of datasets curated through a variety of methodologies, coupled with a thorough evaluation protocol that leverages multiple assessment rubric designs and diverse rater groups, surfaces biases that may be missed via narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. We emphasize that while our framework can identify specific forms of bias, it is not sufficient to holistically assess whether the deployment of an AI system promotes equitable health outcomes. We hope the broader community leverages and builds on these tools and methods towards realizing a shared goal of LLMs that promote accessible and equitable healthcare for all. View details
    Preview abstract Machine learning (ML) holds great promise for improving healthcare, but it is critical to ensure that its use will not propagate or amplify health disparities. An important step is to characterize the (un)fairness of ML models—their tendency to perform differently across subgroups of the population—and to understand its underlying mechanisms. One potential driver of algorithmic unfairness, shortcut learning, arises when ML models base predictions on improper correlations in the training data. Diagnosing this phenomenon is difficult as sensitive attributes may be causally linked with disease. Using multitask learning, we propose a method to directly test for the presence of shortcut learning in clinical ML systems and demonstrate its application to clinical tasks in radiology and dermatology. Finally, our approach reveals instances when shortcutting is not responsible for unfairness, highlighting the need for a holistic approach to fairness mitigation in medical AI. View details
    Large Language Models Encode Clinical Knowledge
    Karan Singhal
    Sara Mahdavi
    Jason Wei
    Hyung Won Chung
    Nathan Scales
    Ajay Tanwani
    Heather Cole-Lewis
    Perry Payne
    Martin Seneviratne
    Paul Gamble
    Abubakr Abdelrazig Hassan Babiker
    Nathanael Schaerli
    Aakanksha Chowdhery
    Philip Mansfield
    Dina Demner-Fushman
    Katherine Chou
    Juraj Gottweis
    Nenad Tomašev
    Alvin Rajkomar
    Joelle Barral
    Nature(2023)
    Preview abstract Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA, MedMCQA, PubMedQA and Measuring Massive Multitask Language Understanding (MMLU) clinical topics), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today’s models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications. View details