Mike Schaekermann

Mike Schaekermann

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Large language models (LLMs) hold promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. We present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and conduct a large-scale empirical case study with the Med-PaLM 2 LLM. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases and EquityMedQA, a collection of seven datasets enriched for adversarial queries. Both our human assessment framework and our dataset design process are grounded in an iterative participatory approach and review of Med-PaLM 2 answers. Through our empirical study, we find that our approach surfaces biases that may be missed by narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. While our approach is not sufficient to holistically assess whether the deployment of an artificial intelligence (AI) system promotes equitable health outcomes, we hope that it can be leveraged and built upon toward a shared goal of LLMs that promote accessible and equitable healthcare. View details
    Health equity assessment of machine learning performance (HEAL): a framework and dermatology AI model case study
    Terry Spitz
    Malcolm Chelliah
    Heather Cole-Lewis
    Stephanie Farquhar
    Qinghan Xue
    Jenna Lester
    Cían Hughes
    Patricia Strachan
    Fraser Tan
    Peggy Bui
    Craig Mermel
    Lily Peng
    Sunny Virmani
    Ivor Horn
    Cameron Chen
    The Lancet eClinicalMedicine (2024)
    Preview abstract Background Artificial intelligence (AI) has repeatedly been shown to encode historical inequities in healthcare. We aimed to develop a framework to quantitatively assess the performance equity of health AI technologies and to illustrate its utility via a case study. Methods Here, we propose a methodology to assess whether health AI technologies prioritise performance for patient populations experiencing worse outcomes, that is complementary to existing fairness metrics. We developed the Health Equity Assessment of machine Learning performance (HEAL) framework designed to quantitatively assess the performance equity of health AI technologies via a four-step interdisciplinary process to understand and quantify domain-specific criteria, and the resulting HEAL metric. As an illustrative case study (analysis conducted between October 2022 and January 2023), we applied the HEAL framework to a dermatology AI model. A set of 5420 teledermatology cases (store-and-forward cases from patients of 20 years or older, submitted from primary care providers in the USA and skin cancer clinics in Australia), enriched for diversity in age, sex and race/ethnicity, was used to retrospectively evaluate the AI model's HEAL metric, defined as the likelihood that the AI model performs better for subpopulations with worse average health outcomes as compared to others. The likelihood that AI performance was anticorrelated to pre-existing health outcomes was estimated using bootstrap methods as the probability that the negated Spearman's rank correlation coefficient (i.e., “R”) was greater than zero. Positive values of R suggest that subpopulations with poorer health outcomes have better AI model performance. Thus, the HEAL metric, defined as p (R >0), measures how likely the AI technology is to prioritise performance for subpopulations with worse average health outcomes as compared to others (presented as a percentage below). Health outcomes were quantified as disability-adjusted life years (DALYs) when grouping by sex and age, and years of life lost (YLLs) when grouping by race/ethnicity. AI performance was measured as top-3 agreement with the reference diagnosis from a panel of 3 dermatologists per case. Findings Across all dermatologic conditions, the HEAL metric was 80.5% for prioritizing AI performance of racial/ethnic subpopulations based on YLLs, and 92.1% and 0.0% respectively for prioritizing AI performance of sex and age subpopulations based on DALYs. Certain dermatologic conditions were significantly associated with greater AI model performance compared to a reference category of less common conditions. For skin cancer conditions, the HEAL metric was 73.8% for prioritizing AI performance of age subpopulations based on DALYs. Interpretation Analysis using the proposed HEAL framework showed that the dermatology AI model prioritised performance for race/ethnicity, sex (all conditions) and age (cancer conditions) subpopulations with respect to pre-existing health disparities. More work is needed to investigate ways of promoting equitable AI performance across age for non-cancer conditions and to better understand how AI models can contribute towards improving equity in health outcomes. View details
    Towards Generalist Biomedical AI
    Danny Driess
    Andrew Carroll
    Chuck Lau
    Ryutaro Tanno
    Ira Ktena
    Anil Palepu
    Basil Mustafa
    Aakanksha Chowdhery
    Simon Kornblith
    Philip Mansfield
    Sushant Prakash
    Renee Wong
    Sunny Virmani
    Sara Mahdavi
    Bradley Green
    Ewa Dominowska
    Joelle Barral
    Karan Singhal
    Pete Florence
    NEJM AI (2024)
    Preview abstract BACKGROUND: Medicine is inherently multimodal, requiring the simultaneous interpretation and integration of insights between many data modalities spanning text, imaging, genomics, and more. Generalist biomedical artificial intelligence systems that flexibly encode, integrate, and interpret these data might better enable impactful applications ranging from scientific discovery to care delivery. METHODS: To catalyze development of these models, we curated MultiMedBench, a new multimodal biomedical benchmark. MultiMedBench encompasses 14 diverse tasks, such as medical question answering, mammography and dermatology image interpretation, radiology report generation and summarization, and genomic variant calling. We then introduced Med-PaLM Multimodal (Med-PaLM M), our proof of concept for a generalist biomedical AI system that flexibly encodes and interprets biomedical data including clinical language, imaging, and genomics with the same set of model weights. To further probe the capabilities and limitations of Med-PaLM M, we conducted a radiologist evaluation of model-generated (and human) chest x-ray reports. RESULTS: We observed encouraging performance across model scales. Med-PaLM M reached performance competitive with or exceeding the state of the art on all MultiMedBench tasks, often surpassing specialist models by a wide margin. In a side-by-side ranking on 246 retrospective chest x-rays, clinicians expressed a pairwise preference for Med-PaLM Multimodal reports over those produced by radiologists in up to 40.50% of cases, suggesting potential clinical utility. CONCLUSIONS: Although considerable work is needed to validate these models in real-world cases and understand if cross-modality generalization is possible, our results represent a milestone toward the development of generalist biomedical artificial intelligence systems. View details
    Conversational AI in health: Design considerations from a Wizard-of-Oz dermatology case study with users, clinicians and a medical LLM
    Brenna Li
    Amy Wang
    Patricia Strachan
    Julie Anne Seguin
    Sami Lachgar
    Karyn Schroeder
    Renee Wong
    Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems, Association for Computing Machinery, pp. 10
    Preview abstract Although skin concerns are common, access to specialist care is limited. Artificial intelligence (AI)-assisted tools to support medical decisions may provide patients with feedback on their concerns while also helping ensure the most urgent cases are routed to dermatologists. Although AI-based conversational agents have been explored recently, how they are perceived by patients and clinicians is not well understood. We conducted a Wizard-of-Oz study involving 18 participants with real skin concerns. Participants were randomly assigned to interact with either a clinician agent (portrayed by a dermatologist) or an LLM agent (supervised by a dermatologist) via synchronous multimodal chat. In both conditions, participants found the conversation to be helpful in understanding their medical situation and alleviate their concerns. Through qualitative coding of the conversation transcripts, we provide insight on the importance of empathy and effective information-seeking. We conclude with design considerations for future AI-based conversational agents in healthcare settings. View details
    Towards Conversational Diagnostic AI
    Anil Palepu
    Khaled Saab
    Jan Freyberg
    Ryutaro Tanno
    Amy Wang
    Brenna Li
    Nenad Tomašev
    Karan Singhal
    Le Hou
    Albert Webson
    Kavita Kulkarni
    Sara Mahdavi
    Juro Gottweis
    Joelle Barral
    Kat Chou
    Arxiv (2024) (to appear)
    Preview abstract At the heart of medicine lies the physician-patient dialogue, where skillful history-taking paves the way for accurate diagnosis, effective management, and enduring trust. Artificial Intelligence (AI) systems capable of diagnostic dialogue could increase accessibility, consistency, and quality of care. However, approximating clinicians' expertise is an outstanding grand challenge. Here, we introduce AMIE (Articulate Medical Intelligence Explorer), a Large Language Model (LLM) based AI system optimized for diagnostic dialogue. AMIE uses a novel self-play based simulated environment with automated feedback mechanisms for scaling learning across diverse disease conditions, specialties, and contexts. We designed a framework for evaluating clinically-meaningful axes of performance including history-taking, diagnostic accuracy, management reasoning, communication skills, and empathy. We compared AMIE's performance to that of primary care physicians (PCPs) in a randomized, double-blind crossover study of text-based consultations with validated patient actors in the style of an Objective Structured Clinical Examination (OSCE). The study included 149 case scenarios from clinical providers in Canada, the UK, and India, 20 PCPs for comparison with AMIE, and evaluations by specialist physicians and patient actors. AMIE demonstrated greater diagnostic accuracy and superior performance on 28 of 32 axes according to specialist physicians and 24 of 26 axes according to patient actors. Our research has several limitations and should be interpreted with appropriate caution. Clinicians were limited to unfamiliar synchronous text-chat which permits large-scale LLM-patient interactions but is not representative of usual clinical practice. While further research is required before AMIE could be translated to real-world settings, the results represent a milestone towards conversational diagnostic AI. View details
    Consensus, dissensus and synergy between clinicians and specialist foundation models in radiology report generation
    Ryutaro Tanno
    David Barrett
    Sumedh Ghaisas
    Sumanth Dathathri
    Abi See
    Johannes Welbl
    Karan Singhal
    Rhys May
    Roy Lee
    SiWai Man
    Zahra Ahmed
    Sara Mahdavi
    Joelle Barral
    Ali Eslami
    Danielle Belgrave
    Shravya Shetty
    Po-Sen Huang
    Ira Ktena
    Arxiv (2023)
    Preview abstract Radiology reports are an instrumental part of modern medicine, informing key clinical decisions such as diagnosis and treatment. The worldwide shortage of radiologists, however, restricts access to expert care and imposes heavy workloads, contributing to avoidable errors and delays in report delivery. While recent progress in automated report generation with vision-language models offer clear potential in ameliorating the situation, the path to real-world adoption has been stymied by the challenge of evaluating the clinical quality of AI-generated reports. In this study, we build a state-of-the-art report generation system for chest radiographs, Flamingo-CXR, by fine-tuning a well-known vision-language foundation model on radiology data. To evaluate the quality of the AI-generated reports, a group of 16 certified radiologists provide detailed evaluations of AI-generated and human written reports for chest X-rays from an intensive care setting in the United States and an inpatient setting in India. At least one radiologist (out of two per case) preferred the AI report to the ground truth report in over 60% of cases for both datasets. Amongst the subset of AI-generated reports that contain errors, the most frequently cited reasons were related to the location and finding, whereas for human written reports, most mistakes were related to severity and finding. This disparity suggested potential complementarity between our AI system and human experts, prompting us to develop an assistive scenario in which Flamingo-CXR generates a first-draft report, which is subsequently revised by a clinician. This is the first demonstration of clinician-AI collaboration for report writing, and the resultant reports are assessed to be equivalent or preferred by at least one radiologist to reports written by experts alone in 80% of in-patient cases and 60% of intensive care cases. View details
    Towards Accurate Differential Diagnosis with Large Language Models
    Daniel McDuff
    Anil Palepu
    Amy Wang
    Karan Singhal
    Yash Sharma
    Kavita Kulkarni
    Le Hou
    Sara Mahdavi
    Sushant Prakash
    Anupam Pathak
    Shwetak Patel
    Ewa Dominowska
    Juro Gottweis
    Joelle Barral
    Kat Chou
    Jake Sunshine
    Arxiv (2023)
    Preview abstract An accurate differential diagnosis (DDx) is a cornerstone of medical care, often reached through an iterative process of interpretation that combines clinical history, physical examination, investigations and procedures. Interactive interfaces powered by Large Language Models (LLMs) present new opportunities to both assist and automate aspects of this process. In this study, we introduce an LLM optimized for diagnostic reasoning, and evaluate its ability to generate a DDx alone or as an aid to clinicians. 20 clinicians evaluated 302 challenging, real-world medical cases sourced from the New England Journal of Medicine (NEJM) case reports. Each case report was read by two clinicians, who were randomized to one of two assistive conditions: either assistance from search engines and standard medical resources, or LLM assistance in addition to these tools. All clinicians provided a baseline, unassisted DDx prior to using the respective assistive tools. Our LLM for DDx exhibited standalone performance that exceeded that of unassisted clinicians (top-10 accuracy 59.1% vs 33.6%, [p = 0.04]). Comparing the two assisted study arms, the DDx quality score was higher for clinicians assisted by our LLM (top-10 accuracy 51.7%) compared to clinicians without its assistance (36.1%) (McNemar's Test: 45.7, p < 0.01) and clinicians with search (44.4%) (4.75, p = 0.03). Further, clinicians assisted by our LLM arrived at more comprehensive differential lists than those without its assistance. Our study suggests that our LLM for DDx has potential to improve clinicians' diagnostic reasoning and accuracy in challenging cases, meriting further real-world evaluation for its ability to empower physicians and widen patients' access to specialist-level expertise. View details
    Preview abstract Recent artificial intelligence (AI) systems have reached milestones in "grand challenges" ranging from Go to protein-folding. The capability to retrieve medical knowledge, reason over it, and answer medical questions comparably to physicians has long been viewed as one such grand challenge. Large language models (LLMs) have catalyzed significant progress in medical question answering; Med-PaLM was the first model to exceed a "passing" score in US Medical Licensing Examination (USMLE) style questions with a score of 67.2% on the MedQA dataset. However, this and other prior work suggested significant room for improvement, especially when models' answers were compared to clinicians' answers. Here we present Med-PaLM 2, which bridges these gaps by leveraging a combination of base LLM improvements (PaLM 2), medical domain finetuning, and prompting strategies including a novel ensemble refinement approach. Med-PaLM 2 scored up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19% and setting a new state-of-the-art. We also observed performance approaching or exceeding state-of-the-art across MedMCQA, PubMedQA, and MMLU clinical topics datasets. We performed detailed human evaluations on long-form questions along multiple axes relevant to clinical applications. In pairwise comparative ranking of 1066 consumer medical questions, physicians preferred Med-PaLM 2 answers to those produced by physicians on eight of nine axes pertaining to clinical utility (p < 0.001). We also observed significant improvements compared to Med-PaLM on every evaluation axis (p < 0.001) on newly introduced datasets of 240 long-form "adversarial" questions to probe LLM limitations. While further studies are necessary to validate the efficacy of these models in real-world settings, these results highlight rapid progress towards physician-level performance in medical question answering. View details
    Real-time diabetic retinopathy screening by deep learning in a multisite national screening programme: a prospective interventional cohort study
    Dr. Paisan Raumviboonsuk
    Variya Nganthavee
    Kornwipa Hemarat
    Apinpat Kongprayoon
    Rajiv Raman
    Brian Levinstein
    Roy Lee
    Sunny Virmani
    John Chambers
    Fred Hersch
    Lily Hao Yi Peng
    The Lancet Digital Health (2022)
    Preview abstract Background: Diabetic retinopathy is a leading cause of preventable blindness, especially in low-income and middle-income countries (LMICs). Deep-learning systems have the potential to enhance diabetic retinopathy screenings in these settings, yet prospective studies assessing their usability and performance are scarce. Methods: We did a prospective interventional cohort study to evaluate the real-world performance and feasibility of deploying a deep-learning system into the health-care system of Thailand. Patients with diabetes and listed on the national diabetes registry, aged 18 years or older, able to have their fundus photograph taken for at least one eye, and due for screening as per the Thai Ministry of Public Health guidelines were eligible for inclusion. Eligible patients were screened with the deep-learning system at nine primary care sites under Thailand's national diabetic retinopathy screening programme. Patients with a previous diagnosis of diabetic macular oedema, severe non-proliferative diabetic retinopathy, or proliferative diabetic retinopathy; previous laser treatment of the retina or retinal surgery; other non-diabetic retinopathy eye disease requiring referral to an ophthalmologist; or inability to have fundus photograph taken of both eyes for any reason were excluded. Deep-learning system-based interpretations of patient fundus images and referral recommendations were provided in real time. As a safety mechanism, regional retina specialists over-read each image. Performance of the deep-learning system (accuracy, sensitivity, specificity, positive predictive value [PPV], and negative predictive value [NPV]) were measured against an adjudicated reference standard, provided by fellowship-trained retina specialists. This study is registered with the Thai national clinical trials registry, TCRT20190902002. Findings: Between Dec 12, 2018, and March 29, 2020, 7940 patients were screened for inclusion. 7651 (96·3%) patients were eligible for study analysis, and 2412 (31·5%) patients were referred for diabetic retinopathy, diabetic macular oedema, ungradable images, or low visual acuity. For vision-threatening diabetic retinopathy, the deep-learning system had an accuracy of 94·7% (95% CI 93·0–96·2), sensitivity of 91·4% (87·1–95·0), and specificity of 95·4% (94·1–96·7). The retina specialist over-readers had an accuracy of 93·5 (91·7–95·0; p=0·17), a sensitivity of 84·8% (79·4–90·0; p=0·024), and specificity of 95·5% (94·1–96·7; p=0·98). The PPV for the deep-learning system was 79·2 (95% CI 73·8–84·3) compared with 75·6 (69·8–81·1) for the over-readers. The NPV for the deep-learning system was 95·5 (92·8–97·9) compared with 92·4 (89·3–95·5) for the over-readers. Interpretation: A deep-learning system can deliver real-time diabetic retinopathy detection capability similar to retina specialists in community-based screening settings. Socioenvironmental factors and workflows must be taken into consideration when implementing a deep-learning system within a large-scale screening programme in LMICs. Funding: Google and Rajavithi Hospital, Bangkok, Thailand. View details
    Data Excellence for AI: Why Should You Care
    Matt Lease
    Praveen Kumar Paritosh
    ACM IX Interactions (2022)
    Preview abstract The efficacy of machine learning (ML) models depends on both algorithms and data. Training data defines what we want our models to learn, and testing data provides the means by which empirical progress is measured. Benchmark datasets such as SQuAD, GLUE, and ImageNet define the entire world within which models exist and operate, yet research continues to focus on critiquing and improving the models, e.g., via shared-task challenges or Kaggle contests, rather than critiquing and improving the data environment in which our models operate. Research and community challenges focused on improving the data itself are relatively rare. If “data is the new oil,” our use of data remains crude today, and we are missing work on the refineries by which the data itself could be optimized for more effective use. Important scientific opportunities and value are being neglected [Schaekermann et al., 2020]. Data is potentially the most under-valued and de-glamorised aspect of today’s AI ecosystem. Data issues are often perceived and characterized as mundane and rote, the “pre-processing” that has to be done before the real (modeling) work can be done. For example, Kandel et al. (2012) emphasize that ML practitioners view data wrangling as tedious and time-consuming. However, Sambasivan et al. (2021) provide examples of how data quality is crucial to ensure that AI systems can accurately represent and predict the phenomenon it is claiming to measure. They introduce four classes of Data Cascades: compounding events causing negative, downstream effects from data issues triggered by conventional AI/ML practices that undervalue data quality. This emphasizes the significance of data due to its downstream impact on user wellbeing and societal effects. Real-world datasets are often ‘dirty’, with various data quality problems (Northcutt et al, 2021), with the risk of “garbage in = garbage out” in terms of the downstream AI systems we train and test on such data. This has inspired a steadily growing body of work on understanding and improving data quality (Chu, et al, 2013; Krishnan, et al, 2016; Redman, et al, 2018; Raman et al, 2001). It also highlights the importance of rigorously managing data quality using mechanisms specific to data validation, instead of relying on model performance as a proxy for data quality (Thomas, et al, 2020). Just as we rigorously test our code for software defects before deployment, we might test for data defects with the same degree of rigor, so that we might detect, prevent, or mitigate weaknesses in ML models caused by underlying issues in data quality. The “Crowdsourcing Adverse Test Sets for Machine Learning (CATS4ML)” Data Challenge (Aroyo and Paritosh, 2021) aims to raise the bar in ML evaluation sets and to find as many examples as possible that are confusing or otherwise problematic for algorithms to process. Similarly to (Vandenhof, 2019) CATS4ML relies on people’s abilities and intuition to spot new data examples about which machine learning is confident, but actually misclassified. This research is inspired by (Attenberg et al, 2015) following the claim “Humans should always be part of machine learning solutions, as they can guide machine learning systems to learn about things that the systems don't yet know — the “unknown unknowns.”” by Iperiotis, (2016). Many benchmark datasets contain instances that are relatively easy (e.g., photos with a subject that is easy to identify). In so doing, they miss the natural ambiguity of the real world in which our models are to be actually applied. Data instances with annotator disagreement are often aggregated to eliminate disagreement (obscuring uncertainty), or filtered out of datasets entirely. Exclusion of difficult and/or ambiguous real-world examples in evaluation risks “toy dataset” benchmarks that diverge from the real data to be encountered in practice. Successful benchmark models fail to generalize to real data, and inflated benchmark results mislead our assessment of state-of-the-art capabilities. ML models become prone to develop “weak spots”, i.e., classes of examples that are difficult or impossible for a model to accurately evaluate, because that class of examples is missing from the evaluation set. Measuring data quality is challenging, nebulous, and often circularly defined, with annotated data defining the “ground truth” on which models are trained and tested [Riezler, 2014]. When dataset quality is considered, the ways in which it is measured in practice is often poorly understood and sometimes simply wrong. Challenges identified include fairness and bias issues in labeled datasets [Goel and Faltings, 2019], quality issues in datasets [Crawford and Paglen, 2019], limitations of benchmarks [Kovaleva et al., 2019, Welty et al., 2019] reproducibility concerns in machine learning research [Pineau et al., 2018, Gunderson and Kjensmo, 2018], lack of documentation and replication of data [Katsuno et al., 2019]. Measurement of AI success and progress today is often metrics-driven, with emphasis on rigorous measurement and A/B testing. However, measuring goodness of the fit of the model to the dataset completely ignores any consideration of how well the dataset fits the real world problem to be solved and its data. Goodness-of-fit metrics, such as F1, Accuracy, AUC, do not tell us much about data fidelity (i.e., how well the dataset represents reality) and validity (how well the data explains things related to the phenomena captured by the data). No standardised metrics exist today for characterising the goodness-of-data [11,13]. Research on metrics is emerging [15,91] but is not yet widely known, accepted, or applied in the AI ecosystem today. As a result, there is an overreliance on goodness-of-fit metrics and post-deployment product metrics. Focusing on fidelity and validity of data will further increase its scientific value and reusability. Such research is necessary for enabling better incentives for data, as it is hard to improve something we can not measure. Researchers in human computation (HCOMP) and various ML-related fields have demonstrated a longstanding interest in applying crowdsourcing approaches to generate human-annotated data for model training and testing [25,128]. A series of workshops (Meta-Eval 2020 @ AAAI, REAIS 2019 @ HCOMP, SAD 2019 @ TheWebConf (WWW), SAD 2018 @ HCOMP) have helped increase further awareness about the issues of data quality for ML evaluation and provide a venue for scholarship on this subject. Because human annotated data represents the compass that the entire ML community relies on, data-focused research, by the HCOMP community and others, can potentially have a multiplicative effect on accelerating progress in ML more broadly. Optimizing the cost, size, and speed of collecting data has attracted significant attention in the first-to-market rush with data. However, aspects of maintainability, reliability, validity, and fidelity of datasets have been often overlooked. We argue we have now reached an inflection point in the field of ML in which attention to neglected data quality is poised to significantly accelerate progress. Toward this end, we advocate for research defining and creating processes to achieve data excellence. We highlight examples, case-studies, and methodologies. This will enable the necessary change in our research culture to value excellence in data practices, which is a critical milestone on the road to enabling the next generation of breakthroughs in ML and AI. View details