Jump to Content
Mike Schaekermann

Mike Schaekermann

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Towards Generalist Biomedical AI
    Danny Driess
    Andrew Carroll
    Chuck Lau
    Ryutaro Tanno
    Ira Ktena
    Anil Palepu
    Basil Mustafa
    Simon Kornblith
    Philip Mansfield
    Sushant Prakash
    Renee Wong
    Sunny Virmani
    Sara Mahdavi
    Bradley Green
    Ewa Dominowska
    Joelle Barral
    Pete Florence
    NEJM AI (2024)
    Preview abstract BACKGROUND: Medicine is inherently multimodal, requiring the simultaneous interpretation and integration of insights between many data modalities spanning text, imaging, genomics, and more. Generalist biomedical artificial intelligence systems that flexibly encode, integrate, and interpret these data might better enable impactful applications ranging from scientific discovery to care delivery. METHODS: To catalyze development of these models, we curated MultiMedBench, a new multimodal biomedical benchmark. MultiMedBench encompasses 14 diverse tasks, such as medical question answering, mammography and dermatology image interpretation, radiology report generation and summarization, and genomic variant calling. We then introduced Med-PaLM Multimodal (Med-PaLM M), our proof of concept for a generalist biomedical AI system that flexibly encodes and interprets biomedical data including clinical language, imaging, and genomics with the same set of model weights. To further probe the capabilities and limitations of Med-PaLM M, we conducted a radiologist evaluation of model-generated (and human) chest x-ray reports. RESULTS: We observed encouraging performance across model scales. Med-PaLM M reached performance competitive with or exceeding the state of the art on all MultiMedBench tasks, often surpassing specialist models by a wide margin. In a side-by-side ranking on 246 retrospective chest x-rays, clinicians expressed a pairwise preference for Med-PaLM Multimodal reports over those produced by radiologists in up to 40.50% of cases, suggesting potential clinical utility. CONCLUSIONS: Although considerable work is needed to validate these models in real-world cases and understand if cross-modality generalization is possible, our results represent a milestone toward the development of generalist biomedical artificial intelligence systems. View details
    Towards Conversational Diagnostic AI
    Anil Palepu
    Khaled Saab
    Jan Freyberg
    Ryutaro Tanno
    Amy Wang
    Brenna Li
    Nenad Tomašev
    Le Hou
    Albert Webson
    Kavita Kulkarni
    Sara Mahdavi
    Juro Gottweis
    Joelle Barral
    Kat Chou
    Arxiv (2024) (to appear)
    Preview abstract At the heart of medicine lies the physician-patient dialogue, where skillful history-taking paves the way for accurate diagnosis, effective management, and enduring trust. Artificial Intelligence (AI) systems capable of diagnostic dialogue could increase accessibility, consistency, and quality of care. However, approximating clinicians' expertise is an outstanding grand challenge. Here, we introduce AMIE (Articulate Medical Intelligence Explorer), a Large Language Model (LLM) based AI system optimized for diagnostic dialogue. AMIE uses a novel self-play based simulated environment with automated feedback mechanisms for scaling learning across diverse disease conditions, specialties, and contexts. We designed a framework for evaluating clinically-meaningful axes of performance including history-taking, diagnostic accuracy, management reasoning, communication skills, and empathy. We compared AMIE's performance to that of primary care physicians (PCPs) in a randomized, double-blind crossover study of text-based consultations with validated patient actors in the style of an Objective Structured Clinical Examination (OSCE). The study included 149 case scenarios from clinical providers in Canada, the UK, and India, 20 PCPs for comparison with AMIE, and evaluations by specialist physicians and patient actors. AMIE demonstrated greater diagnostic accuracy and superior performance on 28 of 32 axes according to specialist physicians and 24 of 26 axes according to patient actors. Our research has several limitations and should be interpreted with appropriate caution. Clinicians were limited to unfamiliar synchronous text-chat which permits large-scale LLM-patient interactions but is not representative of usual clinical practice. While further research is required before AMIE could be translated to real-world settings, the results represent a milestone towards conversational diagnostic AI. View details
    Health equity assessment of machine learning performance (HEAL): a framework and dermatology AI model case study
    Terry Spitz
    Malcolm Chelliah
    Heather Cole-Lewis
    Stephanie Farquhar
    Qinghan Xue
    Jenna Lester
    Cían Hughes
    Patricia Strachan
    Fraser Tan
    Peggy Bui
    Craig Mermel
    Lily Peng
    Sunny Virmani
    Ivor Horn
    Cameron Chen
    The Lancet eClinicalMedicine (2024)
    Preview abstract Background Artificial intelligence (AI) has repeatedly been shown to encode historical inequities in healthcare. We aimed to develop a framework to quantitatively assess the performance equity of health AI technologies and to illustrate its utility via a case study. Methods Here, we propose a methodology to assess whether health AI technologies prioritise performance for patient populations experiencing worse outcomes, that is complementary to existing fairness metrics. We developed the Health Equity Assessment of machine Learning performance (HEAL) framework designed to quantitatively assess the performance equity of health AI technologies via a four-step interdisciplinary process to understand and quantify domain-specific criteria, and the resulting HEAL metric. As an illustrative case study (analysis conducted between October 2022 and January 2023), we applied the HEAL framework to a dermatology AI model. A set of 5420 teledermatology cases (store-and-forward cases from patients of 20 years or older, submitted from primary care providers in the USA and skin cancer clinics in Australia), enriched for diversity in age, sex and race/ethnicity, was used to retrospectively evaluate the AI model's HEAL metric, defined as the likelihood that the AI model performs better for subpopulations with worse average health outcomes as compared to others. The likelihood that AI performance was anticorrelated to pre-existing health outcomes was estimated using bootstrap methods as the probability that the negated Spearman's rank correlation coefficient (i.e., “R”) was greater than zero. Positive values of R suggest that subpopulations with poorer health outcomes have better AI model performance. Thus, the HEAL metric, defined as p (R >0), measures how likely the AI technology is to prioritise performance for subpopulations with worse average health outcomes as compared to others (presented as a percentage below). Health outcomes were quantified as disability-adjusted life years (DALYs) when grouping by sex and age, and years of life lost (YLLs) when grouping by race/ethnicity. AI performance was measured as top-3 agreement with the reference diagnosis from a panel of 3 dermatologists per case. Findings Across all dermatologic conditions, the HEAL metric was 80.5% for prioritizing AI performance of racial/ethnic subpopulations based on YLLs, and 92.1% and 0.0% respectively for prioritizing AI performance of sex and age subpopulations based on DALYs. Certain dermatologic conditions were significantly associated with greater AI model performance compared to a reference category of less common conditions. For skin cancer conditions, the HEAL metric was 73.8% for prioritizing AI performance of age subpopulations based on DALYs. Interpretation Analysis using the proposed HEAL framework showed that the dermatology AI model prioritised performance for race/ethnicity, sex (all conditions) and age (cancer conditions) subpopulations with respect to pre-existing health disparities. More work is needed to investigate ways of promoting equitable AI performance across age for non-cancer conditions and to better understand how AI models can contribute towards improving equity in health outcomes. View details
    Preview abstract Large language models (LLMs) hold immense promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. In this work, we present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and then conduct an empirical case study with Med-PaLM 2, resulting in the largest human evaluation study in this area to date. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases, and EquityMedQA, a collection of seven newly-released datasets comprising both manually-curated and LLM-generated questions enriched for adversarial queries. Both our human assessment framework and dataset design process are grounded in an iterative participatory approach and review of possible biases in Med-PaLM 2 answers to adversarial queries. Through our empirical study, we find that the use of a collection of datasets curated through a variety of methodologies, coupled with a thorough evaluation protocol that leverages multiple assessment rubric designs and diverse rater groups, surfaces biases that may be missed via narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. We emphasize that while our framework can identify specific forms of bias, it is not sufficient to holistically assess whether the deployment of an AI system promotes equitable health outcomes. We hope the broader community leverages and builds on these tools and methods towards realizing a shared goal of LLMs that promote accessible and equitable healthcare for all. View details
    Preview abstract Recent artificial intelligence (AI) systems have reached milestones in "grand challenges" ranging from Go to protein-folding. The capability to retrieve medical knowledge, reason over it, and answer medical questions comparably to physicians has long been viewed as one such grand challenge. Large language models (LLMs) have catalyzed significant progress in medical question answering; Med-PaLM was the first model to exceed a "passing" score in US Medical Licensing Examination (USMLE) style questions with a score of 67.2% on the MedQA dataset. However, this and other prior work suggested significant room for improvement, especially when models' answers were compared to clinicians' answers. Here we present Med-PaLM 2, which bridges these gaps by leveraging a combination of base LLM improvements (PaLM 2), medical domain finetuning, and prompting strategies including a novel ensemble refinement approach. Med-PaLM 2 scored up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19% and setting a new state-of-the-art. We also observed performance approaching or exceeding state-of-the-art across MedMCQA, PubMedQA, and MMLU clinical topics datasets. We performed detailed human evaluations on long-form questions along multiple axes relevant to clinical applications. In pairwise comparative ranking of 1066 consumer medical questions, physicians preferred Med-PaLM 2 answers to those produced by physicians on eight of nine axes pertaining to clinical utility (p < 0.001). We also observed significant improvements compared to Med-PaLM on every evaluation axis (p < 0.001) on newly introduced datasets of 240 long-form "adversarial" questions to probe LLM limitations. While further studies are necessary to validate the efficacy of these models in real-world settings, these results highlight rapid progress towards physician-level performance in medical question answering. View details
    Consensus, dissensus and synergy between clinicians and specialist foundation models in radiology report generation
    Ryutaro Tanno
    David Barrett
    Sumedh Ghaisas
    Sumanth Dathathri
    Abi See
    Johannes Welbl
    Rhys May
    Roy Lee
    SiWai Man
    Zahra Ahmed
    Sara Mahdavi
    Joelle Barral
    Ali Eslami
    Danielle Belgrave
    Shravya Shetty
    Po-Sen Huang
    Ira Ktena
    Arxiv (2023)
    Preview abstract Radiology reports are an instrumental part of modern medicine, informing key clinical decisions such as diagnosis and treatment. The worldwide shortage of radiologists, however, restricts access to expert care and imposes heavy workloads, contributing to avoidable errors and delays in report delivery. While recent progress in automated report generation with vision-language models offer clear potential in ameliorating the situation, the path to real-world adoption has been stymied by the challenge of evaluating the clinical quality of AI-generated reports. In this study, we build a state-of-the-art report generation system for chest radiographs, Flamingo-CXR, by fine-tuning a well-known vision-language foundation model on radiology data. To evaluate the quality of the AI-generated reports, a group of 16 certified radiologists provide detailed evaluations of AI-generated and human written reports for chest X-rays from an intensive care setting in the United States and an inpatient setting in India. At least one radiologist (out of two per case) preferred the AI report to the ground truth report in over 60% of cases for both datasets. Amongst the subset of AI-generated reports that contain errors, the most frequently cited reasons were related to the location and finding, whereas for human written reports, most mistakes were related to severity and finding. This disparity suggested potential complementarity between our AI system and human experts, prompting us to develop an assistive scenario in which Flamingo-CXR generates a first-draft report, which is subsequently revised by a clinician. This is the first demonstration of clinician-AI collaboration for report writing, and the resultant reports are assessed to be equivalent or preferred by at least one radiologist to reports written by experts alone in 80% of in-patient cases and 60% of intensive care cases. View details
    Towards Accurate Differential Diagnosis with Large Language Models
    Daniel McDuff
    Anil Palepu
    Amy Wang
    Yash Sharma
    Kavita Kulkarni
    Le Hou
    Sara Mahdavi
    Sushant Prakash
    Anupam Pathak
    Shwetak Patel
    Ewa Dominowska
    Juro Gottweis
    Joelle Barral
    Kat Chou
    Jake Sunshine
    Arxiv (2023)
    Preview abstract An accurate differential diagnosis (DDx) is a cornerstone of medical care, often reached through an iterative process of interpretation that combines clinical history, physical examination, investigations and procedures. Interactive interfaces powered by Large Language Models (LLMs) present new opportunities to both assist and automate aspects of this process. In this study, we introduce an LLM optimized for diagnostic reasoning, and evaluate its ability to generate a DDx alone or as an aid to clinicians. 20 clinicians evaluated 302 challenging, real-world medical cases sourced from the New England Journal of Medicine (NEJM) case reports. Each case report was read by two clinicians, who were randomized to one of two assistive conditions: either assistance from search engines and standard medical resources, or LLM assistance in addition to these tools. All clinicians provided a baseline, unassisted DDx prior to using the respective assistive tools. Our LLM for DDx exhibited standalone performance that exceeded that of unassisted clinicians (top-10 accuracy 59.1% vs 33.6%, [p = 0.04]). Comparing the two assisted study arms, the DDx quality score was higher for clinicians assisted by our LLM (top-10 accuracy 51.7%) compared to clinicians without its assistance (36.1%) (McNemar's Test: 45.7, p < 0.01) and clinicians with search (44.4%) (4.75, p = 0.03). Further, clinicians assisted by our LLM arrived at more comprehensive differential lists than those without its assistance. Our study suggests that our LLM for DDx has potential to improve clinicians' diagnostic reasoning and accuracy in challenging cases, meriting further real-world evaluation for its ability to empower physicians and widen patients' access to specialist-level expertise. View details
    Data Excellence for AI: Why Should You Care
    Matt Lease
    Praveen Kumar Paritosh
    ACM IX Interactions (2022)
    Preview abstract The efficacy of machine learning (ML) models depends on both algorithms and data. Training data defines what we want our models to learn, and testing data provides the means by which empirical progress is measured. Benchmark datasets such as SQuAD, GLUE, and ImageNet define the entire world within which models exist and operate, yet research continues to focus on critiquing and improving the models, e.g., via shared-task challenges or Kaggle contests, rather than critiquing and improving the data environment in which our models operate. Research and community challenges focused on improving the data itself are relatively rare. If “data is the new oil,” our use of data remains crude today, and we are missing work on the refineries by which the data itself could be optimized for more effective use. Important scientific opportunities and value are being neglected [Schaekermann et al., 2020]. Data is potentially the most under-valued and de-glamorised aspect of today’s AI ecosystem. Data issues are often perceived and characterized as mundane and rote, the “pre-processing” that has to be done before the real (modeling) work can be done. For example, Kandel et al. (2012) emphasize that ML practitioners view data wrangling as tedious and time-consuming. However, Sambasivan et al. (2021) provide examples of how data quality is crucial to ensure that AI systems can accurately represent and predict the phenomenon it is claiming to measure. They introduce four classes of Data Cascades: compounding events causing negative, downstream effects from data issues triggered by conventional AI/ML practices that undervalue data quality. This emphasizes the significance of data due to its downstream impact on user wellbeing and societal effects. Real-world datasets are often ‘dirty’, with various data quality problems (Northcutt et al, 2021), with the risk of “garbage in = garbage out” in terms of the downstream AI systems we train and test on such data. This has inspired a steadily growing body of work on understanding and improving data quality (Chu, et al, 2013; Krishnan, et al, 2016; Redman, et al, 2018; Raman et al, 2001). It also highlights the importance of rigorously managing data quality using mechanisms specific to data validation, instead of relying on model performance as a proxy for data quality (Thomas, et al, 2020). Just as we rigorously test our code for software defects before deployment, we might test for data defects with the same degree of rigor, so that we might detect, prevent, or mitigate weaknesses in ML models caused by underlying issues in data quality. The “Crowdsourcing Adverse Test Sets for Machine Learning (CATS4ML)” Data Challenge (Aroyo and Paritosh, 2021) aims to raise the bar in ML evaluation sets and to find as many examples as possible that are confusing or otherwise problematic for algorithms to process. Similarly to (Vandenhof, 2019) CATS4ML relies on people’s abilities and intuition to spot new data examples about which machine learning is confident, but actually misclassified. This research is inspired by (Attenberg et al, 2015) following the claim “Humans should always be part of machine learning solutions, as they can guide machine learning systems to learn about things that the systems don't yet know — the “unknown unknowns.”” by Iperiotis, (2016). Many benchmark datasets contain instances that are relatively easy (e.g., photos with a subject that is easy to identify). In so doing, they miss the natural ambiguity of the real world in which our models are to be actually applied. Data instances with annotator disagreement are often aggregated to eliminate disagreement (obscuring uncertainty), or filtered out of datasets entirely. Exclusion of difficult and/or ambiguous real-world examples in evaluation risks “toy dataset” benchmarks that diverge from the real data to be encountered in practice. Successful benchmark models fail to generalize to real data, and inflated benchmark results mislead our assessment of state-of-the-art capabilities. ML models become prone to develop “weak spots”, i.e., classes of examples that are difficult or impossible for a model to accurately evaluate, because that class of examples is missing from the evaluation set. Measuring data quality is challenging, nebulous, and often circularly defined, with annotated data defining the “ground truth” on which models are trained and tested [Riezler, 2014]. When dataset quality is considered, the ways in which it is measured in practice is often poorly understood and sometimes simply wrong. Challenges identified include fairness and bias issues in labeled datasets [Goel and Faltings, 2019], quality issues in datasets [Crawford and Paglen, 2019], limitations of benchmarks [Kovaleva et al., 2019, Welty et al., 2019] reproducibility concerns in machine learning research [Pineau et al., 2018, Gunderson and Kjensmo, 2018], lack of documentation and replication of data [Katsuno et al., 2019]. Measurement of AI success and progress today is often metrics-driven, with emphasis on rigorous measurement and A/B testing. However, measuring goodness of the fit of the model to the dataset completely ignores any consideration of how well the dataset fits the real world problem to be solved and its data. Goodness-of-fit metrics, such as F1, Accuracy, AUC, do not tell us much about data fidelity (i.e., how well the dataset represents reality) and validity (how well the data explains things related to the phenomena captured by the data). No standardised metrics exist today for characterising the goodness-of-data [11,13]. Research on metrics is emerging [15,91] but is not yet widely known, accepted, or applied in the AI ecosystem today. As a result, there is an overreliance on goodness-of-fit metrics and post-deployment product metrics. Focusing on fidelity and validity of data will further increase its scientific value and reusability. Such research is necessary for enabling better incentives for data, as it is hard to improve something we can not measure. Researchers in human computation (HCOMP) and various ML-related fields have demonstrated a longstanding interest in applying crowdsourcing approaches to generate human-annotated data for model training and testing [25,128]. A series of workshops (Meta-Eval 2020 @ AAAI, REAIS 2019 @ HCOMP, SAD 2019 @ TheWebConf (WWW), SAD 2018 @ HCOMP) have helped increase further awareness about the issues of data quality for ML evaluation and provide a venue for scholarship on this subject. Because human annotated data represents the compass that the entire ML community relies on, data-focused research, by the HCOMP community and others, can potentially have a multiplicative effect on accelerating progress in ML more broadly. Optimizing the cost, size, and speed of collecting data has attracted significant attention in the first-to-market rush with data. However, aspects of maintainability, reliability, validity, and fidelity of datasets have been often overlooked. We argue we have now reached an inflection point in the field of ML in which attention to neglected data quality is poised to significantly accelerate progress. Toward this end, we advocate for research defining and creating processes to achieve data excellence. We highlight examples, case-studies, and methodologies. This will enable the necessary change in our research culture to value excellence in data practices, which is a critical milestone on the road to enabling the next generation of breakthroughs in ML and AI. View details
    Real-time diabetic retinopathy screening by deep learning in a multisite national screening programme: a prospective interventional cohort study
    Dr. Paisan Raumviboonsuk
    Variya Nganthavee
    Kornwipa Hemarat
    Apinpat Kongprayoon
    Rajiv Raman
    Brian Levinstein
    Roy Lee
    Sunny Virmani
    John Chambers
    Fred Hersch
    Lily Hao Yi Peng
    The Lancet Digital Health (2022)
    Preview abstract Background: Diabetic retinopathy is a leading cause of preventable blindness, especially in low-income and middle-income countries (LMICs). Deep-learning systems have the potential to enhance diabetic retinopathy screenings in these settings, yet prospective studies assessing their usability and performance are scarce. Methods: We did a prospective interventional cohort study to evaluate the real-world performance and feasibility of deploying a deep-learning system into the health-care system of Thailand. Patients with diabetes and listed on the national diabetes registry, aged 18 years or older, able to have their fundus photograph taken for at least one eye, and due for screening as per the Thai Ministry of Public Health guidelines were eligible for inclusion. Eligible patients were screened with the deep-learning system at nine primary care sites under Thailand's national diabetic retinopathy screening programme. Patients with a previous diagnosis of diabetic macular oedema, severe non-proliferative diabetic retinopathy, or proliferative diabetic retinopathy; previous laser treatment of the retina or retinal surgery; other non-diabetic retinopathy eye disease requiring referral to an ophthalmologist; or inability to have fundus photograph taken of both eyes for any reason were excluded. Deep-learning system-based interpretations of patient fundus images and referral recommendations were provided in real time. As a safety mechanism, regional retina specialists over-read each image. Performance of the deep-learning system (accuracy, sensitivity, specificity, positive predictive value [PPV], and negative predictive value [NPV]) were measured against an adjudicated reference standard, provided by fellowship-trained retina specialists. This study is registered with the Thai national clinical trials registry, TCRT20190902002. Findings: Between Dec 12, 2018, and March 29, 2020, 7940 patients were screened for inclusion. 7651 (96·3%) patients were eligible for study analysis, and 2412 (31·5%) patients were referred for diabetic retinopathy, diabetic macular oedema, ungradable images, or low visual acuity. For vision-threatening diabetic retinopathy, the deep-learning system had an accuracy of 94·7% (95% CI 93·0–96·2), sensitivity of 91·4% (87·1–95·0), and specificity of 95·4% (94·1–96·7). The retina specialist over-readers had an accuracy of 93·5 (91·7–95·0; p=0·17), a sensitivity of 84·8% (79·4–90·0; p=0·024), and specificity of 95·5% (94·1–96·7; p=0·98). The PPV for the deep-learning system was 79·2 (95% CI 73·8–84·3) compared with 75·6 (69·8–81·1) for the over-readers. The NPV for the deep-learning system was 95·5 (92·8–97·9) compared with 92·4 (89·3–95·5) for the over-readers. Interpretation: A deep-learning system can deliver real-time diabetic retinopathy detection capability similar to retina specialists in community-based screening settings. Socioenvironmental factors and workflows must be taken into consideration when implementing a deep-learning system within a large-scale screening programme in LMICs. Funding: Google and Rajavithi Hospital, Bangkok, Thailand. View details
    Longitudinal Screening for Diabetic Retinopathy in a Nationwide Screening Program: Comparing Deep Learning and Human Graders
    Jirawut Limwattanayingyong
    Variya Nganthavee
    Kasem Seresirikachorn
    Tassapol Singalavanija
    Ngamphol Soonthornworasiri
    Varis Ruamviboonsuk
    Chetan Rao
    Rajiv Raman
    Andrzej Grzybowski
    Lily Hao Yi Peng
    Fred Hersch
    Richa Tiwari, PhD
    Dr. Paisan Raumviboonsuk
    Journal of Diabetes Research (2020)
    Preview abstract Objective. To evaluate diabetic retinopathy (DR) screening via deep learning (DL) and trained human graders (HG) in a longitudinal cohort, as case spectrum shifts based on treatment referral and new-onset DR. Methods. We randomly selected patients with diabetes screened twice, two years apart within a nationwide screening program. The reference standard was established via adjudication by retina specialists. Each patient’s color fundus photographs were graded, and a patient was considered as having sight-threatening DR (STDR) if the worse eye had severe nonproliferative DR, proliferative DR, or diabetic macular edema. We compared DR screening via two modalities: DL and HG. For each modality, we simulated treatment referral by excluding patients with detected STDR from the second screening using that modality. Results. There were 5,738 patients (12.3% STDR) in the first screening. DL and HG captured different numbers of STDR cases, and after simulated referral and excluding ungradable cases, 4,148 and 4,263 patients remained in the second screening, respectively. The STDR prevalence at the second screening was 5.1% and 6.8% for DL- and HG-based screening, respectively. Along with the prevalence decrease, the sensitivity for both modalities decreased from the first to the second screening (DL: from 95% to 90%, p=0.008; HG: from 74% to 57%, p<0.001). At both the first and second screenings, the rate of false negatives for the DL was a fifth that of HG (0.5-0.6% vs. 2.9-3.2%). Conclusion. On 2-year longitudinal follow-up of a DR screening cohort, STDR prevalence decreased for both DL- and HG-based screening. Follow-up screenings in longitudinal DR screening can be more difficult and induce lower sensitivity for both DL and HG, though the false negative rate was substantially lower for DL. Our data may be useful for health-economics analyses of longitudinal screening settings. View details
    Expert Discussions Improve Comprehension of Difficult Cases in Medical Image Assessment
    Abigail E. Huang
    ACM CHI Conference on Human Factors in Computing Systems (CHI 2020) (2020) (to appear)
    Preview abstract Medical data labeling workflows critically depend on accurate assessments from human experts. Yet human assessments can vary markedly, even among medical experts. Prior research has demonstrated benefits of labeler training on performance. Here we utilized two types of labeler training feedback: highlighting incorrect labels for difficult cases ("individual performance" feedback), and expert discussions from adjudication of these cases. We presented ten non-specialist eye care professionals with either individual performance alone, or individual performance and expert discussions. Compared to performance feedback alone, seeing expert discussions significantly improved non-specialists' understanding of the rationale behind the correct diagnosis while motivating changes in their own labeling approach; and also significantly improved average accuracy on one of four pathologies in a held-out test set. This work suggests that image adjudication may provide benefits beyond developing trusted consensus labels, and that exposure to specialist discussions can be an effective training intervention for medical diagnosis. View details
    Preview abstract Purpose To develop and validate a deep learning (DL) algorithm that predicts referable glaucomatous optic neuropathy (GON) and optic nerve head (ONH) features from color fundus images, to determine the relative importance of these features in referral decisions by glaucoma specialists (GSs) and the algorithm, and to compare the performance of the algorithm with eye care providers. Design Development and validation of an algorithm. Participants Fundus images from screening programs, studies, and a glaucoma clinic. Methods A DL algorithm was trained using a retrospective dataset of 86 618 images, assessed for glaucomatous ONH features and referable GON (defined as ONH appearance worrisome enough to justify referral for comprehensive examination) by 43 graders. The algorithm was validated using 3 datasets: dataset A (1205 images, 1 image/patient; 18.1% referable), images adjudicated by panels of GSs; dataset B (9642 images, 1 image/patient; 9.2% referable), images from a diabetic teleretinal screening program; and dataset C (346 images, 1 image/patient; 81.7% referable), images from a glaucoma clinic. Main Outcome Measures The algorithm was evaluated using the area under the receiver operating characteristic curve (AUC), sensitivity, and specificity for referable GON and glaucomatous ONH features. Results The algorithm’s AUC for referable GON was 0.945 (95% confidence interval [CI], 0.929–0.960) in dataset A, 0.855 (95% CI, 0.841–0.870) in dataset B, and 0.881 (95% CI, 0.838–0.918) in dataset C. Algorithm AUCs ranged between 0.661 and 0.973 for glaucomatous ONH features. The algorithm showed significantly higher sensitivity than 7 of 10 graders not involved in determining the reference standard, including 2 of 3 GSs, and showed higher specificity than 3 graders (including 1 GS), while remaining comparable to others. For both GSs and the algorithm, the most crucial features related to referable GON were: presence of vertical cup-to-disc ratio of 0.7 or more, neuroretinal rim notching, retinal nerve fiber layer defect, and bared circumlinear vessels. Conclusions A DL algorithm trained on fundus images alone can detect referable GON with higher sensitivity than and comparable specificity to eye care providers. The algorithm maintained good performance on an independent dataset with diagnoses based on a full glaucoma workup. View details
    Preview abstract Purpose: To present and evaluate a remote, tool-based system and structured grading rubric for adjudicating image-based diabetic retinopathy (DR) grades. Methods: We compared three different procedures for adjudicating DR severity assessments among retina specialist panels, including (1) in-person adjudication based on a previously described procedure (Baseline), (2) remote, tool-based adjudication for assessing DR severity alone (TA), and (3) remote, tool-based adjudication using a feature-based rubric (TA-F). We developed a system allowing graders to review images remotely and asynchronously. For both TA and TA-F approaches, images with disagreement were reviewed by all graders in a round-robin fashion until disagreements were resolved. Five panels of three retina specialists each adjudicated a set of 499 retinal fundus images (1 panel using Baseline, 2 using TA, and 2 using TA-F adjudication). Reliability was measured as grade agreement among the panels using Cohen's quadratically weighted kappa. Efficiency was measured as the number of rounds needed to reach a consensus for tool-based adjudication. Results: The grades from remote, tool-based adjudication showed high agreement with the Baseline procedure, with Cohen's kappa scores of 0.948 and 0.943 for the two TA panels, and 0.921 and 0.963 for the two TA-F panels. Cases adjudicated using TA-F were resolved in fewer rounds compared with TA (P < 0.001; standard permutation test). Conclusions: Remote, tool-based adjudication presents a flexible and reliable alternative to in-person adjudication for DR diagnosis. Feature-based rubrics can help accelerate consensus for tool-based adjudication of DR without compromising label quality. Translational Relevance: This approach can generate reference standards to validate automated methods, and resolve ambiguous diagnoses by integrating into existing telemedical workflows. View details
    No Results Found