Alan Karthikesalingam

Alan Karthikesalingam

Alan is a Principal Scientist and Director at Google DeepMind working on biomedical AI. He cofounded and leads AI co-clinician and AI co-scientist; and previously founded and led precursors in AMIE, Med-PaLM, Med-PaLM-2 and Med-PaLM-Multimodal. His research in over 10 full Nature/Nature-Medicine publication spans breakthroughs for conversational and diagnostic AI systems in multiple domains. These include the first example of an AI system passing medical license-style examinations, competently undertaking medical consultations, and solving multimodal diagnostic challenges spanning cardiology, radiology, oncology, ophthalmology, dermatology and electronic health records. He is an honorary Lecturer in Vascular Surgery at Imperial College in London. He completed his MA in Neuroscience and Medical Degree (MBBChir) at the University of Cambridge before specialist training in surgery in the London Deanery, where he completed his Membership of the Royal College of Surgeons (MRCS), PhD in Vascular Surgery and appointment as NIHR Clinical Lecturer. Prior to joining Google DeepMind Alan had published over 150 peer-reviewed articles including first-author studies in the New England Journal of Medicine and The Lancet.

Research Areas

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
CoDaS: AI Co-Data-Scientist for Biomarker Discovery via Wearable Sensors
Juro Gottweis
CJ Park
Salman Rahman
Ahmed Metwally
Hong Yu
Ivor Rendulic
Yuzhe Yang
Petar Sirkovic
Daniel McDuff
Shwetak Patel
Nicolas Stroppa
Yubin Kim
Mark Malhotra
Orson Xu
Sam Schmidgall
Tim Althoff
Elahe Vedadi
Cynthia Breazeal
Hae Won Park
(2026)
Preview abstract Generative Artificial Intelligence (AI), particularly Large Language Models (LLMs), have demonstrated significant potential in clinical reasoning skills such as history-taking and differential diagnosis generation—critical aspects of medical education. This work explores how LLMs can augment medical curricula through interactive learning. We conducted a participatory design process with medical students, residents and medical education experts to co-create an AI-powered tutor prototype for clinical reasoning. As part of the co-design process, we conducted a qualitative user study, investigating learning needs and practices via interviews, and conducting concept evaluations through interactions with the prototype. Findings highlight the challenges learners face in transitioning from theoretical knowledge to practical application, and how an AI tutor can provide personalized practice and feedback. We conclude with design considerations, emphasizing the importance of context-specific knowledge and emulating positive preceptor traits, to guide the development of AI tools for medical education. View details
Preview abstract Note this is a re-submission of a previously approved ITP. The previous approval was conditional for a journal pre-sub enquiry only and we are submitting a new ITP for the preprint of the paper. AI models have been proposed for hypothesis generation, but testing their ability to drive high-impact research is challenging, since an AI-generated hypothesis can take decades to validate. Here, we challenge the ability of a recently developed LLM-based platform to generate high-level hypotheses by posing a question that took years to resolve experimentally but remained unpublished: How could capsid-forming phage-inducible chromosomal islands (cf-PICIs) spread across bacterial species? Remarkably, the AI’s top- ranked hypothesis matched our experimentally confirmed mechanism: cf-PICIs hijack diverse phage tails to expand their host range. We critically assess the AI’s five highest- ranked hypotheses, showing that some opened new research avenues in our laboratories. We benchmark its performance against other LLMs and outline best practices for integrating AI into scientific discovery. Our findings suggest that AI can act not just as a computational tool, but as a creative engine, accelerating discovery and reshaping how we generate and test scientific hypotheses. View details
Preview abstract While large language models (LLMs) have shown promise in diagnostic dialogue, their capabilities for effective management reasoning - including disease progression, therapeutic response, and safe medication prescription - remain under-explored. We advance the previously demonstrated diagnostic capabilities of the Articulate Medical Intelligence Explorer (AMIE) through a new LLM-based agentic system optimised for clinical management and dialogue, incorporating reasoning over the evolution of disease and multiple patient visit encounters, response to therapy, and professional competence in medication prescription. To ground its reasoning in authoritative clinical knowledge, AMIE leverages Gemini's long-context capabilities, combining in-context retrieval with structured reasoning to align its output with relevant and up-to-date clinical practice guidelines and drug formularies. In a randomized, blinded virtual Objective Structured Clinical Examination (OSCE) study, AMIE was compared to 21 primary care physicians (PCPs) across 100 multi-visit case scenarios designed to reflect UK NICE Guidance and BMJ Best Practice guidelines. AMIE was non-inferior to PCPs in management reasoning as assessed by specialist physicians and scored better in both preciseness of treatments and investigations, and in its alignment with and grounding of management plans in clinical guidelines. To benchmark medication reasoning, we developed RxQA, a multiple-choice question benchmark derived from two national drug formularies (US, UK) and validated by board-certified pharmacists. While AMIE and PCPs both benefited from the ability to access external drug information, AMIE outperformed PCPs on higher difficulty questions. While further research would be needed before real-world translation, AMIE's strong performance across evaluations marks a significant step towards conversational AI as a tool in disease management. View details
Generative models improve fairness of medical classifiers under distribution shifts
Ira Ktena
Olivia Wiles
Isabela Albuquerque
Sylvestre-Alvise Rebuffi
Ryutaro Tanno
Danielle Belgrave
Taylan Cemgil
Nature Medicine (2024)
Preview abstract Domain generalization is a ubiquitous challenge for machine learning in healthcare. Model performance in real-world conditions might be lower than expected because of discrepancies between the data encountered during deployment and development. Underrepresentation of some groups or conditions during model development is a common cause of this phenomenon. This challenge is often not readily addressed by targeted data acquisition and ‘labeling’ by expert clinicians, which can be prohibitively expensive or practically impossible because of the rarity of conditions or the available clinical expertise. We hypothesize that advances in generative artificial intelligence can help mitigate this unmet need in a steerable fashion, enriching our training dataset with synthetic examples that address shortfalls of underrepresented conditions or subgroups. We show that diffusion models can automatically learn realistic augmentations from data in a label-efficient manner. We demonstrate that learned augmentations make models more robust and statistically fair in-distribution and out of distribution. To evaluate the generality of our approach, we studied three distinct medical imaging contexts of varying difficulty: (1) histopathology, (2) chest X-ray and (3) dermatology images. Complementing real samples with synthetic ones improved the robustness of models in all three medical tasks and increased fairness by improving the accuracy of clinical diagnosis within underrepresented groups, especially out of distribution. View details
Quantifying urban park use in the USA at scale: empirical estimates of realised park usage using smartphone location data
Michael T Young
Swapnil Vispute
Stylianos Serghiou
Akim Kumok
Yash Shah
Kevin J. Lane
Flannery Black-Ingersoll
Paige Brochu
Monica Bharel
Sarah Skenazy
Shailesh Bavadekar
Mansi Kansal
Evgeniy Gabrilovich
Gregory A. Wellenius
Lancet Planetary Health (2024)
Preview abstract Summary Background A large body of evidence connects access to greenspace with substantial benefits to physical and mental health. In urban settings where access to greenspace can be limited, park access and use have been associated with higher levels of physical activity, improved physical health, and lower levels of markers of mental distress. Despite the potential health benefits of urban parks, little is known about how park usage varies across locations (between or within cities) or over time. Methods We estimated park usage among urban residents (identified as residents of urban census tracts) in 498 US cities from 2019 to 2021 from aggregated and anonymised opted-in smartphone location history data. We used descriptive statistics to quantify differences in park usage over time, between cities, and across census tracts within cities, and used generalised linear models to estimate the associations between park usage and census tract level descriptors. Findings In spring (March 1 to May 31) 2019, 18·9% of urban residents visited a park at least once per week, with average use higher in northwest and southwest USA, and lowest in the southeast. Park usage varied substantially both within and between cities; was unequally distributed across census tract-level markers of race, ethnicity, income, and social vulnerability; and was only moderately correlated with established markers of census tract greenspace. In spring 2019, a doubling of walking time to parks was associated with a 10·1% (95% CI 5·6–14·3) lower average weekly park usage, adjusting for city and social vulnerability index. The median decline in park usage from spring 2019 to spring 2020 was 38·0% (IQR 28·4–46·5), coincident with the onset of physical distancing policies across much of the country. We estimated that the COVID-19-related decline in park usage was more pronounced for those living further from a park and those living in areas of higher social vulnerability. Interpretation These estimates provide novel insights into the patterns and correlates of park use and could enable new studies of the health benefits of urban greenspace. In addition, the availability of an empirical park usage metric that varies over time could be a useful tool for assessing the effectiveness of policies intended to increase such activities. View details
Preview abstract Large language models (LLMs) hold promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. We present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and conduct a large-scale empirical case study with the Med-PaLM 2 LLM. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases and EquityMedQA, a collection of seven datasets enriched for adversarial queries. Both our human assessment framework and our dataset design process are grounded in an iterative participatory approach and review of Med-PaLM 2 answers. Through our empirical study, we find that our approach surfaces biases that may be missed by narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. While our approach is not sufficient to holistically assess whether the deployment of an artificial intelligence (AI) system promotes equitable health outcomes, we hope that it can be leveraged and built upon toward a shared goal of LLMs that promote accessible and equitable healthcare. View details
Understanding metric-related pitfalls in image analysis validation
Annika Reinke
Lena Maier-Hein
Paul Jager
Shravya Shetty
Understanding Metrics Workgroup
Nature Methods (2024)
Preview abstract Validation metrics are key for the reliable tracking of scientific progress and for bridging the current chasm between artificial intelligence (AI) research and its translation into practice. However, increasing evidence shows that particularly in image analysis, metrics are often chosen inadequately in relation to the underlying research problem. This could be attributed to a lack of accessibility of metric-related knowledge: While taking into account the individual strengths, weaknesses, and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multi-stage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides the first reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Focusing on biomedical image analysis but with the potential of transfer to other fields, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. To facilitate comprehension, illustrations and specific examples accompany each pitfall. As a structured body of information accessible to researchers of all levels of expertise, this work enhances global comprehension of a key topic in image analysis validation. View details
Creating an Empirical Dermatology Dataset Through Crowdsourcing With Web Search Advertisements
Abbi Ward
Jimmy Li
Julie Wang
Sriram Lakshminarasimhan
Ashley Carrick
Jay Hartford
Pradeep Kumar S
Sunny Virmani
Renee Wong
Margaret Ann Smith
Dawn Siegel
Steven Lin
Justin Ko
JAMA Network Open (2024)
Preview abstract Importance: Health datasets from clinical sources do not reflect the breadth and diversity of disease, impacting research, medical education, and artificial intelligence tool development. Assessments of novel crowdsourcing methods to create health datasets are needed. Objective: To evaluate if web search advertisements (ads) are effective at creating a diverse and representative dermatology image dataset. Design, Setting, and Participants: This prospective observational survey study, conducted from March to November 2023, used Google Search ads to invite internet users in the US to contribute images of dermatology conditions with demographic and symptom information to the Skin Condition Image Network (SCIN) open access dataset. Ads were displayed against dermatology-related search queries on mobile devices, inviting contributions from adults after a digital informed consent process. Contributions were filtered for image safety and measures were taken to protect privacy. Data analysis occurred January to February 2024. Exposure: Dermatologist condition labels as well as estimated Fitzpatrick Skin Type (eFST) and estimated Monk Skin Tone (eMST) labels. Main Outcomes and Measures: The primary metrics of interest were the number, quality, demographic diversity, and distribution of clinical conditions in the crowdsourced contributions. Spearman rank order correlation was used for all correlation analyses, and the χ2 test was used to analyze differences between SCIN contributor demographics and the US census. Results: In total, 5749 submissions were received, with a median of 22 (14-30) per day. Of these, 5631 (97.9%) were genuine images of dermatological conditions. Among contributors with self-reported demographic information, female contributors (1732 of 2596 contributors [66.7%]) and younger contributors (1329 of 2556 contributors [52.0%] aged <40 years) had a higher representation in the dataset compared with the US population. Of 2614 contributors who reported race and ethnicity, 852 (32.6%) reported a racial or ethnic identity other than White. Dermatologist confidence in assigning a differential diagnosis increased with the number of self-reported demographic and skin-condition–related variables (Spearman R = 0.1537; P < .001). Of 4019 contributions reporting duration since onset, 2170 (54.0%) reported onset within less than 7 days of submission. Of the 2835 contributions that could be assigned a dermatological differential diagnosis, 2523 (89.0%) were allergic, infectious, or inflammatory conditions. eFST and eMST distributions reflected the geographical origin of the dataset. Conclusions and Relevance: The findings of this survey study suggest that search ads are effective at crowdsourcing dermatology images and could therefore be a useful method to create health datasets. The SCIN dataset bridges important gaps in the availability of images of common, short-duration skin conditions. View details
Conversational AI in health: Design considerations from a Wizard-of-Oz dermatology case study with users, clinicians and a medical LLM
Brenna Li
Amy Wang
Patricia Strachan
Julie Anne Seguin
Sami Lachgar
Karyn Schroeder
Renee Wong
Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems, Association for Computing Machinery, pp. 10
Preview abstract Although skin concerns are common, access to specialist care is limited. Artificial intelligence (AI)-assisted tools to support medical decisions may provide patients with feedback on their concerns while also helping ensure the most urgent cases are routed to dermatologists. Although AI-based conversational agents have been explored recently, how they are perceived by patients and clinicians is not well understood. We conducted a Wizard-of-Oz study involving 18 participants with real skin concerns. Participants were randomly assigned to interact with either a clinician agent (portrayed by a dermatologist) or an LLM agent (supervised by a dermatologist) via synchronous multimodal chat. In both conditions, participants found the conversation to be helpful in understanding their medical situation and alleviate their concerns. Through qualitative coding of the conversation transcripts, we provide insight on the importance of empathy and effective information-seeking. We conclude with design considerations for future AI-based conversational agents in healthcare settings. View details
×