Jump to Content
Dale Webster

Dale Webster

Dale Webster is Director of Research at Google Health working to improve patient outcomes in healthcare using Deep Learning and Medical Imaging. His recent work leverages AI to screen for Diabetic Retinopathy in India and Thailand, predict Cardiovascular health factors from fundus photos, differential diagnosis of skin disease, and applications of medically tuned LLMs. Prior to Google he was a Software Engineer at Pacific Biosciences working on direct sequencing of methylation state and rapid sequencing and assembly of microbial pathogens during global outbreaks. His PhD work in Bioinformatics at the University of California San Francisco focused on viral evolution, and he received his Bachelor of Science in Computer Science from Rice University.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract Advances in machine learning for health care have brought concerns about bias from the research community; specifically, the introduction, perpetuation, or exacerbation of care disparities. Reinforcing these concerns is the finding that medical images often reveal signals about sensitive attributes in ways that are hard to pinpoint by both algorithms and people. This finding raises a question about how to best design general purpose pretrained embeddings (GPPEs, defined as embeddings meant to support a broad array of use cases) for building downstream models that are free from particular types of bias. The downstream model should be carefully evaluated for bias, and audited and improved as appropriate. However, in our view, well intentioned attempts to prevent the upstream components—GPPEs—from learning sensitive attributes can have unintended consequences on the downstream models. Despite producing a veneer of technical neutrality, the resultant end-to-end system might still be biased or poorly performing. We present reasons, by building on previously published data, to support the reasoning that GPPEs should ideally contain as much information as the original data contain, and highlight the perils of trying to remove sensitive attributes from a GPPE. We also emphasise that downstream prediction models trained for specific tasks and settings, whether developed using GPPEs or not, should be carefully designed and evaluated to avoid bias that makes models vulnerable to issues such as distributional shift. These evaluations should be done by a diverse team, including social scientists, on a diverse cohort representing the full breadth of the patient population for which the final model is intended. View details
    Towards Generalist Biomedical AI
    Danny Driess
    Andrew Carroll
    Chuck Lau
    Ryutaro Tanno
    Ira Ktena
    Anil Palepu
    Basil Mustafa
    Simon Kornblith
    Philip Mansfield
    Sushant Prakash
    Renee Wong
    Sunny Virmani
    Sara Mahdavi
    Bradley Green
    Ewa Dominowska
    Joelle Barral
    NEJM AI (2024)
    Preview abstract BACKGROUND: Medicine is inherently multimodal, requiring the simultaneous interpretation and integration of insights between many data modalities spanning text, imaging, genomics, and more. Generalist biomedical artificial intelligence systems that flexibly encode, integrate, and interpret these data might better enable impactful applications ranging from scientific discovery to care delivery. METHODS: To catalyze development of these models, we curated MultiMedBench, a new multimodal biomedical benchmark. MultiMedBench encompasses 14 diverse tasks, such as medical question answering, mammography and dermatology image interpretation, radiology report generation and summarization, and genomic variant calling. We then introduced Med-PaLM Multimodal (Med-PaLM M), our proof of concept for a generalist biomedical AI system that flexibly encodes and interprets biomedical data including clinical language, imaging, and genomics with the same set of model weights. To further probe the capabilities and limitations of Med-PaLM M, we conducted a radiologist evaluation of model-generated (and human) chest x-ray reports. RESULTS: We observed encouraging performance across model scales. Med-PaLM M reached performance competitive with or exceeding the state of the art on all MultiMedBench tasks, often surpassing specialist models by a wide margin. In a side-by-side ranking on 246 retrospective chest x-rays, clinicians expressed a pairwise preference for Med-PaLM Multimodal reports over those produced by radiologists in up to 40.50% of cases, suggesting potential clinical utility. CONCLUSIONS: Although considerable work is needed to validate these models in real-world cases and understand if cross-modality generalization is possible, our results represent a milestone toward the development of generalist biomedical artificial intelligence systems. (Funded by Alphabet Inc. and/or a subsidiary thereof.) View details
    Differences between Patient and Clinician Submitted Images: Implications for Virtual Care of Skin Conditions
    Grace Eunhae Hong
    Margaret Ann Smith
    Aaron Loh
    Vijaytha Muralidharan
    Doris Wong
    Michelle Phung
    Nicolas Betancourt
    Bradley Fong
    Rachna Sahasrabudhe
    Khoban Nasim
    Alec Eschholz
    Kat Chou
    Peggy Bui
    Justin Ko
    Steven Lin
    Mayo Clinic Proceedings: Digital Health (2024)
    Preview abstract Objective: To understand and highlight the differences in clinical, demographic, and image quality characteristics between patient-taken (PAT) and clinic-taken (CLIN) photographs of skin conditions. Patients and Methods: This retrospective study applied logistic regression to data from 2500 deidentified cases in Stanford Health Care’s eConsult system, from November 2015 to January 2021. Cases with undiagnosable or multiple conditions or cases with both patient and clinician image sources were excluded, leaving 628 PAT cases and 1719 CLIN cases. Demographic characteristic factors, such as age and sex were self-reported, whereas anatomic location, estimated skin type, clinical signs and symptoms, condition duration, and condition frequency were summarized from patient health records. Image quality variables such as blur, lighting issues and whether the image contained skin, hair, or nails were estimated through a deep learning model. Results: Factors that were positively associated with CLIN photographs, post-2020 were as follows: age 60 years or older, darker skin types (eFST V/VI), and presence of skin growths. By contrast, factors that were positively associated with PAT photographs include conditions appearing intermittently, cases with blurry photographs, photographs with substantial nonskin (or nail/hair) regions and cases with more than 3 photographs. Within the PAT cohort, older age was associated with blurry photographs. Conclusion: There are various demographic, clinical, and image quality characteristic differences between PAT and CLIN photographs of skin concerns. The demographic characteristic differences present important considerations for improving digital literacy or access, whereas the image quality differences point to the need for improved patient education and better image capture workflows, particularly among elderly patients. View details
    Preview abstract Task-specific deep learning models in histopathology offer promising opportunities for improving diagnosis, clinical research, and precision medicine. However, development of such models is often limited by availability of high-quality data. Foundation models in histopathology that learn general representations across a wide range of tissue types, diagnoses, and magnifications offer the potential to reduce the data, compute, and technical expertise necessary to develop task-specific deep learning models with the required level of model performance. In this work, we describe the development and evaluation of foundation models for histopathology via self-supervised learning (SSL). We first establish a diverse set of benchmark tasks involving 17 unique tissue types and 12 unique cancer types and spanning different optimal magnifications and task types. Next, we use this benchmark to explore and evaluate histopathology-specific SSL methods followed by further evaluation on held out patch-level and weakly supervised tasks. We found that standard SSL methods thoughtfully applied to histopathology images are performant across our benchmark tasks and that domain-specific methodological improvements can further increase performance. Our findings reinforce the value of using domain-specific SSL methods in pathology, and establish a set of high quality foundation models to enable further research across diverse applications. View details
    Pathologist Validation of a Machine Learning–Derived Feature for Colon Cancer Risk Stratification
    Vincenzo L’Imperio
    Markus Plass
    Heimo Müller
    Nicolò Tamini
    Luca Gianotti
    Nicola Zucchini
    Robert Reihs
    Lily Peng
    Cameron Chen
    Marialuisa Lavitrano
    David F. Steiner
    Kurt Zatloukal
    Fabio Pagni
    JAMA Network Open (2023)
    Preview abstract Importance: Identifying new prognostic features in colon cancer has the potential to refine histopathologic review and inform patient care. Although prognostic artificial intelligence systems have recently demonstrated significant risk stratification for several cancer types, studies have not yet shown that the machine learning–derived features associated with these prognostic artificial intelligence systems are both interpretable and usable by pathologists. Objective: To evaluate whether pathologist scoring of a histopathologic feature previously identified by machine learning is associated with survival among patients with colon cancer. Design, Setting, and Participants: This prognostic study used deidentified, archived colorectal cancer cases from January 2013 to December 2015 from the University of Milano-Bicocca. All available histologic slides from 258 consecutive colon adenocarcinoma cases were reviewed from December 2021 to February 2022 by 2 pathologists, who conducted semiquantitative scoring for tumor adipose feature (TAF), which was previously identified via a prognostic deep learning model developed with an independent colorectal cancer cohort. Main Outcomes and Measures: Prognostic value of TAF for overall survival and disease-specific survival as measured by univariable and multivariable regression analyses. Interpathologist agreement in TAF scoring was also evaluated. Results: A total of 258 colon adenocarcinoma histopathologic cases from 258 patients (138 men [53%]; median age, 67 years [IQR, 65-81 years]) with stage II (n = 119) or stage III (n = 139) cancer were included. Tumor adipose feature was identified in 120 cases (widespread in 63 cases, multifocal in 31, and unifocal in 26). For overall survival analysis after adjustment for tumor stage, TAF was independently prognostic in 2 ways: TAF as a binary feature (presence vs absence: hazard ratio [HR] for presence of TAF, 1.55 [95% CI, 1.07-2.25]; P = .02) and TAF as a semiquantitative categorical feature (HR for widespread TAF, 1.87 [95% CI, 1.23-2.85]; P = .004). Interpathologist agreement for widespread TAF vs lower categories (absent, unifocal, or multifocal) was 90%, corresponding to a κ metric at this threshold of 0.69 (95% CI, 0.58-0.80). Conclusions and Relevance: In this prognostic study, pathologists were able to learn and reproducibly score for TAF, providing significant risk stratification on this independent data set. Although additional work is warranted to understand the biological significance of this feature and to establish broadly reproducible TAF scoring, this work represents the first validation to date of human expert learning from machine learning in pathology. Specifically, this validation demonstrates that a computationally identified histologic feature can represent a human-identifiable, prognostic feature with the potential for integration into pathology practice. View details
    Discovering novel systemic biomarkers in external eye photos
    Ilana Traynis
    Christina Chen
    Akib Uddin
    Jorge Cuadros
    Lauren P. Daskivich
    April Y. Maa
    Ramasamy Kim
    Eugene Yu-Chuan Kang
    Lily Peng
    Avinash Varadarajan
    The Lancet Digital Health (2023)
    Preview abstract Background Photographs of the external eye were recently shown to reveal signs of diabetic retinal disease and elevated glycated haemoglobin. This study aimed to test the hypothesis that external eye photographs contain information about additional systemic medical conditions. Methods We developed a deep learning system (DLS) that takes external eye photographs as input and predicts systemic parameters, such as those related to the liver (albumin, aspartate aminotransferase [AST]); kidney (estimated glomerular filtration rate [eGFR], urine albumin-to-creatinine ratio [ACR]); bone or mineral (calcium); thyroid (thyroid stimulating hormone); and blood (haemoglobin, white blood cells [WBC], platelets). This DLS was trained using 123 130 images from 38 398 patients with diabetes undergoing diabetic eye screening in 11 sites across Los Angeles county, CA, USA. Evaluation focused on nine prespecified systemic parameters and leveraged three validation sets (A, B, C) spanning 25 510 patients with and without diabetes undergoing eye screening in three independent sites in Los Angeles county, CA, and the greater Atlanta area, GA, USA. We compared performance against baseline models incorporating available clinicodemographic variables (eg, age, sex, race and ethnicity, years with diabetes). Findings Relative to the baseline, the DLS achieved statistically significant superior performance at detecting AST >36.0 U/L, calcium <8.6 mg/dL, eGFR <60.0 mL/min/1.73 m2, haemoglobin <11.0 g/dL, platelets <150.0 × 103/μL, ACR ≥300 mg/g, and WBC <4.0 × 103/μL on validation set A (a population resembling the development datasets), with the area under the receiver operating characteristic curve (AUC) of the DLS exceeding that of the baseline by 5.3–19.9% (absolute differences in AUC). On validation sets B and C, with substantial patient population differences compared with the development datasets, the DLS outperformed the baseline for ACR ≥300.0 mg/g and haemoglobin <11.0 g/dL by 7.3–13.2%. Interpretation We found further evidence that external eye photographs contain biomarkers spanning multiple organ systems. Such biomarkers could enable accessible and non-invasive screening of disease. Further work is needed to understand the translational implications. View details
    Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging
    Laura Anne Culp
    Jan Freyberg
    Basil Mustafa
    Sebastien Baur
    Simon Kornblith
    Ting Chen
    Patricia MacWilliams
    Sara Mahdavi
    Megan Zoë Walker
    Aaron Loh
    Cameron Chen
    Scott Mayer McKinney
    Zach William Beaver
    Fiona Keleher Ryan
    Mozziyar Etemadi
    Umesh Telang
    Lily Hao Yi Peng
    Geoffrey Everest Hinton
    Mohammad Norouzi
    Nature Biomedical Engineering (2023)
    Preview abstract Machine-learning models for medical tasks can match or surpass the performance of clinical experts. However, in settings differing from those of the training dataset, the performance of a model can deteriorate substantially. Here we report a representation-learning strategy for machine-learning models applied to medical-imaging tasks that mitigates such ‘out of distribution’ performance problem and that improves model robustness and training efficiency. The strategy, which we named REMEDIS (for ‘Robust and Efficient Medical Imaging with Self-supervision’), combines large-scale supervised transfer learning on natural images and intermediate contrastive self-supervised learning on medical images and requires minimal task-specific customization. We show the utility of REMEDIS in a range of diagnostic-imaging tasks covering six imaging domains and 15 test datasets, and by simulating three realistic out-of-distribution scenarios. REMEDIS improved in-distribution diagnostic accuracies up to 11.5% with respect to strong supervised baseline models, and in out-of-distribution settings required only 1–33% of the data for retraining to match the performance of supervised models retrained using all available data. REMEDIS may accelerate the development lifecycle of machine-learning models for medical imaging. View details
    Preview abstract The application of an artificial intelligence (AI)-based screening tool for retinal disease in India and Thailand highlighted the myths and reality of introducing medical AI, which may form a framework for subsequent tools. View details
    Large Language Models Encode Clinical Knowledge
    Sara Mahdavi
    Jason Wei
    Hyung Won Chung
    Nathan Scales
    Ajay Tanwani
    Heather Cole-Lewis
    Perry Payne
    Martin Seneviratne
    Paul Gamble
    Abubakr Abdelrazig Hassan Babiker
    Nathanael Schaerli
    Philip Mansfield
    Dina Demner-Fushman
    Katherine Chou
    Juraj Gottweis
    Nenad Tomašev
    Alvin Rajkomar
    Joelle Barral
    Nature (2023)
    Preview abstract Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model1 (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM2 on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA3, MedMCQA4, PubMedQA5 and Measuring Massive Multitask Language Understanding (MMLU) clinical topics6), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today’s models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications. View details
    Risk Stratification for Diabetic Retinopathy Screening Order Using Deep Learning: A Multicenter Prospective Study
    Ashish Bora
    Sunny Virmani
    Rayman Huang
    Ilana Traynis
    Lily Peng
    Avinash Varadarajan
    Warisara Pattanapongpaiboon
    Reena Chopra
    Dr. Paisan Raumviboonsuk
    Translational Vision Science & Technology (2023)
    Preview abstract Purpose: Real-world evaluation of a deep learning model that prioritizes patients based on risk of progression to moderate or worse (MOD+) diabetic retinopathy (DR). Methods: This nonrandomized, single-arm, prospective, interventional study included patients attending DR screening at four centers across Thailand from September 2019 to January 2020, with mild or no DR. Fundus photographs were input into the model, and patients were scheduled for their subsequent screening from September 2020 to January 2021 in order of predicted risk. Evaluation focused on model sensitivity, defined as correctly ranking patients that developed MOD+ within the first 50% of subsequent screens. Results: We analyzed 1,757 patients, of which 52 (3.0%) developed MOD+. Using the model-proposed order, the model's sensitivity was 90.4%. Both the model-proposed order and mild/no DR plus HbA1c had significantly higher sensitivity than the random order (P < 0.001). Excluding one major (rural) site that had practical implementation challenges, the remaining sites included 567 patients and 15 (2.6%) developed MOD+. Here, the model-proposed order achieved 86.7% versus 73.3% for the ranking that used DR grade and hemoglobin A1c. Conclusions: The model can help prioritize follow-up visits for the largest subgroups of DR patients (those with no or mild DR). Further research is needed to evaluate the impact on clinical management and outcomes. Translational Relevance: Deep learning demonstrated potential for risk stratification in DR screening. However, real-world practicalities must be resolved to fully realize the benefit. View details
    Detection of signs of disease in external photographs of the eyes via deep learning
    Akinori Mitani
    Ilana Traynis
    Naho Kitade
    April Maa
    Jorge Cuadros
    Lily Hao Yi Peng
    Avinash Vaidyanathan Varadarajan
    Nature Biomedical Engineering (2022)
    Preview abstract Retinal fundus photographs can be used to detect a range of retinal conditions. Here we show that deep-learning models trained instead on external photographs of the eyes can be used to detect diabetic retinopathy (DR), diabetic macular oedema and poor blood glucose control. We developed the models using eye photographs from 145,832 patients with diabetes from 301 DR screening sites and evaluated the models on four tasks and four validation datasets with a total of 48,644 patients from 198 additional screening sites. For all four tasks, the predictive performance of the deep-learning models was significantly higher than the performance of logistic regression models using self-reported demographic and medical history data, and the predictions generalized to patients with dilated pupils, to patients from a different DR screening programme and to a general eye care programme that included diabetics and non-diabetics. We also explored the use of the deep-learning models for the detection of elevated lipid levels. The utility of external eye photographs for the diagnosis and management of diseases should be further validated with images from different cameras and patient populations. View details
    Preview abstract Recently it was shown that blood hemoglobin concentration could be predicted from retinal fundus photographs by deep learning models. However, it is unclear whether the models were quantifying current blood hemoglobin level, or estimating based on subjects' pretest probability of having anemia. Here, we conducted an observational study with 14 volunteers who donated blood at an on site blood drive held by the local blood center (ie, at which time approximately 10% of their blood was removed). When the deep learning model was applied to retinal fundus photographs taken before and after blood donation, it detected a decrease in blood hemoglobin concentration within each subject at 2-3 days after donation, suggesting that the model was quantifying subacute hemoglobin changes instead of predicting subjects' risk. Additional randomized or controlled studies can further validate this finding. View details
    Performance of a Diabetic Retinopathy Artificial Intelligence Algorithm for Ultra-widefield Imaging
    Tunde Peto
    Lloyd Aiello
    Srinivas R Sadda
    Drew Lewis
    Anne Marie Cairns
    Dana Keane
    Sunny Virmani
    Jerry Cavallerano
    Barba Hamill
    Lily Peng
    Sara Ellen Godek
    Lu Yang
    Naho Kitade
    Kira Whitehouse
    ARVO (2022)
    Preview abstract Purpose: To evaluate the performance of a deep learning model for diabetic retinopathy (DR) and diabetic macular edema screening when using ultra-widefield (UWF) imaging. Methods: For model development, 67,200 UWF images were collected from DR programs and ophthalmology clinics worldwide. 30,836 images were double graded and adjudicated at 8 grading centres by 125 certified graders using ETDRS extension of the Modified Airlie House Classification of Diabetic Retinopathy following the JVN Clinical Trial Ultrawide Field Grading Manual v1.0. The grading system used traditional ETDRS 7-SF field definition as well as extended fields 3-7 to evaluate the retinal periphery. A further 36,364 UWF images were graded using a grading protocol based on the ICDR classification. The dataset was split into training, tuning and testing. The final DR model is an ensemble of 10 EfficientNet-b0 neural networks, independently trained with standard image augmentation techniques. For model validation, two independent sets of images were collected. Model performance was evaluated by comparing its predictions to the adjudicated ground truth for both sets of images. Results: Prior to clinical validation, the model performance was internally evaluated on an independent set of 1967 images, of which 1050 were graded via adjudication as negative for more than mild diabetic retinopathy (mtmDR negative), and 917 as having referable diabetic retinopathy (mtmDR positive). The overall performance (Table 1) was weighted by target DR distribution. Clinical validation evaluated an independent data set of 420 images selected to achieve a target distribution that enabled appropriate confidence intervals for mtmDR sensitivity and specificity A panel of three graders adjudicated these 420 images and assessed 241 as mtmDR negative, 179 as mtmDR positive and 135 as vtDR positive. Model’s performance on the clinical validation set is shown in Table 2. Conclusions: The deep learning model was developed with high quality graded UWF images and performed at a level that highly suggests usefulness in a clinical screening setting. A large, prospective multi-center clinical trial is currently evaluating the performance of a similar model in a real-world clinical setting. This abstract was presented at the 2022 ARVO Annual Meeting, held in Denver, CO, May 1-4, 2022, and virtually. View details
    Real-time diabetic retinopathy screening by deep learning in a multisite national screening programme: a prospective interventional cohort study
    Dr. Paisan Raumviboonsuk
    Variya Nganthavee
    Kornwipa Hemarat
    Apinpat Kongprayoon
    Rajiv Raman
    Brian Levinstein
    Roy Lee
    Sunny Virmani
    John Chambers
    Fred Hersch
    Lily Hao Yi Peng
    The Lancet Digital Health (2022)
    Preview abstract Background: Diabetic retinopathy is a leading cause of preventable blindness, especially in low-income and middle-income countries (LMICs). Deep-learning systems have the potential to enhance diabetic retinopathy screenings in these settings, yet prospective studies assessing their usability and performance are scarce. Methods: We did a prospective interventional cohort study to evaluate the real-world performance and feasibility of deploying a deep-learning system into the health-care system of Thailand. Patients with diabetes and listed on the national diabetes registry, aged 18 years or older, able to have their fundus photograph taken for at least one eye, and due for screening as per the Thai Ministry of Public Health guidelines were eligible for inclusion. Eligible patients were screened with the deep-learning system at nine primary care sites under Thailand's national diabetic retinopathy screening programme. Patients with a previous diagnosis of diabetic macular oedema, severe non-proliferative diabetic retinopathy, or proliferative diabetic retinopathy; previous laser treatment of the retina or retinal surgery; other non-diabetic retinopathy eye disease requiring referral to an ophthalmologist; or inability to have fundus photograph taken of both eyes for any reason were excluded. Deep-learning system-based interpretations of patient fundus images and referral recommendations were provided in real time. As a safety mechanism, regional retina specialists over-read each image. Performance of the deep-learning system (accuracy, sensitivity, specificity, positive predictive value [PPV], and negative predictive value [NPV]) were measured against an adjudicated reference standard, provided by fellowship-trained retina specialists. This study is registered with the Thai national clinical trials registry, TCRT20190902002. Findings: Between Dec 12, 2018, and March 29, 2020, 7940 patients were screened for inclusion. 7651 (96·3%) patients were eligible for study analysis, and 2412 (31·5%) patients were referred for diabetic retinopathy, diabetic macular oedema, ungradable images, or low visual acuity. For vision-threatening diabetic retinopathy, the deep-learning system had an accuracy of 94·7% (95% CI 93·0–96·2), sensitivity of 91·4% (87·1–95·0), and specificity of 95·4% (94·1–96·7). The retina specialist over-readers had an accuracy of 93·5 (91·7–95·0; p=0·17), a sensitivity of 84·8% (79·4–90·0; p=0·024), and specificity of 95·5% (94·1–96·7; p=0·98). The PPV for the deep-learning system was 79·2 (95% CI 73·8–84·3) compared with 75·6 (69·8–81·1) for the over-readers. The NPV for the deep-learning system was 95·5 (92·8–97·9) compared with 92·4 (89·3–95·5) for the over-readers. Interpretation: A deep-learning system can deliver real-time diabetic retinopathy detection capability similar to retina specialists in community-based screening settings. Socioenvironmental factors and workflows must be taken into consideration when implementing a deep-learning system within a large-scale screening programme in LMICs. Funding: Google and Rajavithi Hospital, Bangkok, Thailand. View details
    Deep learning to detect optical coherence tomography-derived diabetic macular edema from retinal photographs: a multicenter validation study
    Xinle Sheila Liu
    Tayyeba Ali
    Ami Shah
    Scott Mayer McKinney
    Paisan Ruamviboonsuk
    Angus W. Turner
    Pearse A. Keane
    Peranut Chotcomwongse
    Variya Nganthavee
    Mark Chia
    Josef Huemer
    Jorge Cuadros
    Rajiv Raman
    Lily Hao Yi Peng
    Avinash Vaidyanathan Varadarajan
    Reena Chopra
    Ophthalmology Retina (2022)
    Preview abstract Purpose To validate the generalizability of a deep learning system (DLS) that detects diabetic macular edema (DME) from two-dimensional color fundus photography (CFP), where the reference standard for retinal thickness and fluid presence is derived from three-dimensional optical coherence tomography (OCT). Design Retrospective validation of a DLS across international datasets. Participants Paired CFP and OCT of patients from diabetic retinopathy (DR) screening programs or retina clinics. The DLS was developed using datasets from Thailand, the United Kingdom (UK) and the United States and validated using 3,060 unique eyes from 1,582 patients across screening populations in Australia, India and Thailand. The DLS was separately validated in 698 eyes from 537 screened patients in the UK with mild DR and suspicion of DME based on CFP. Methods The DLS was trained using DME labels from OCT. Presence of DME was based on retinal thickening or intraretinal fluid. The DLS’s performance was compared to expert grades of maculopathy and to a previous proof-of-concept version of the DLS. We further simulated integration of the current DLS into an algorithm trained to detect DR from CFPs. Main Outcome Measures Superiority of specificity and non-inferiority of sensitivity of the DLS for the detection of center-involving DME, using device specific thresholds, compared to experts. Results Primary analysis in a combined dataset spanning Australia, India, and Thailand showed the DLS had 80% specificity and 81% sensitivity compared to expert graders who had 59% specificity and 70% sensitivity. Relative to human experts, the DLS had significantly higher specificity (p=0.008) and non-inferior sensitivity (p 50%) and a sensitivity of 100% (p=0.02 for sensitivity > 90%). Conclusions The DLS can generalize to multiple international populations with an accuracy exceeding experts. The clinical value of this DLS to reduce false positive referrals, thus decreasing the burden on specialist eye care, warrants prospective evaluation. View details
    Preview abstract AI models have shown promise in performing many medical imaging tasks. However, our ability to explain what signals these models learn from the training data is severely lacking. Explanations are needed in order to increase the trust of doctors in AI-based models, especially in domains where AI prediction capabilities surpass those of humans. Moreover, such explanations could enable novel scientific discovery by uncovering signals in the data that aren’t yet known to experts. In this paper, we present a method for automatic visual explanations that can help achieve these goals by generating hypotheses of what visual signals in the images are correlated with the task. We propose the following 4 steps: (i) Train a classifier to perform a given task to assess whether the imagery indeed contains signals relevant to the task; (ii) Train a StyleGAN-based image generator with an architecture that enables guidance by the classifier (“StylEx”); (iii) Automatically detect and extract the top visual attributes that the classifier is sensitive to. Each of these attributes can then be independently modified for a set of images to generate counterfactual visualizations of those attributes (i.e. what that image would look like with the attribute increased or decreased); (iv) Present the discovered attributes and corresponding counterfactual visualizations to a multidisciplinary panel of experts to formulate hypotheses for the underlying mechanisms with consideration to social and structural determinants of health (e.g. whether the attributes correspond to known patho-physiological or socio-cultural phenomena, or could be novel discoveries) and stimulate future research. To demonstrate the broad applicability of our approach, we demonstrate results on eight prediction tasks across three medical imaging modalities – retinal fundus photographs, external eye photographs, and chest radiographs. We showcase examples where many of the automatically-learned attributes clearly capture clinically known features (e.g., types of cataract, enlarged heart), and demonstrate automatically-learned confounders that arise from factors beyond physiological mechanisms (e.g., chest X-ray underexposure is correlated with the classifier predicting abnormality, and eye makeup is correlated with the classifier predicting low hemoglobin levels). We further show that our method reveals a number of physiologically plausible novel attributes for future investigation (e.g., differences in the fundus associated with self-reported sex, which were previously unknown). While our approach is not able to discern causal pathways, the ability to generate hypotheses from the attribute visualizations has the potential to enable researchers to better understand, improve their assessment, and extract new knowledge from AI-based models. Importantly, we highlight that attributes generated by our framework can capture phenomena beyond physiology or pathophysiology, reflecting the real world nature of healthcare delivery and socio-cultural factors, and hence multidisciplinary perspectives are critical in these investigations. Finally, we release code to enable researchers to train their own StylEx models and analyze their predictive tasks of interest, and use the methodology presented in this paper for responsible interpretation of the revealed attributes. View details
    Preview abstract Background: Many dermatologic cases are first evaluated by primary care physicians or nurse practitioners. Objective: This study aimed to evaluate an artificial intelligence (AI)-based tool that assists with interpreting dermatologic conditions. Methods: We developed an AI-based tool and conducted a randomized multi-reader, multi-case study (20 primary care physicians, 20 nurse practitioners, and 1047 retrospective teledermatology cases) to evaluate its utility. Cases were enriched and comprised 120 skin conditions. Readers were recruited to optimize for geographical diversity; the primary care physicians practiced across 12 states (2-32 years of experience, mean 11.3 years), and the nurse practitioners practiced across 9 states (2-34 years of experience, mean 13.1 years). To avoid memory effects from incomplete washout, each case was read once by each clinician either with or without AI assistance, with the assignment randomized. The primary analyses evaluated the top-1 agreement, defined as the agreement rate of the clinicians’ primary diagnosis with the reference diagnoses provided by a panel of dermatologists (per case: 3 dermatologists from a pool of 12, practicing across 8 states, with 5-13 years of experience, mean 7.2 years of experience). We additionally conducted subgroup analyses stratified by cases’ self-reported race and ethnicity and measured the performance spread: the maximum performance subtracted by the minimum across subgroups. Results: The AI’s standalone top-1 agreement was 63%, and AI assistance was significantly associated with higher agreement with reference diagnoses. For primary care physicians, the increase in diagnostic agreement was 10% (P<.001), from 48% to 58%; for nurse practitioners, the increase was 12% (P<.001), from 46% to 58%. When stratified by cases’ self-reported race or ethnicity, the AI’s performance was 59%-62% for Asian, Native Hawaiian, Pacific Islander, other, and Hispanic or Latinx individuals and 67% for both Black or African American and White subgroups. For the clinicians, AI assistance–associated improvements across subgroups were in the range of 8%-12% for primary care physicians and 8%-15% for nurse practitioners. The performance spread across subgroups was 5.3% unassisted vs 6.6% assisted for primary care physicians and 5.2% unassisted vs 6.0% assisted for nurse practitioners. In both unassisted and AI-assisted modalities, and for both primary care physicians and nurse practitioners, the subgroup with the highest performance on average was Black or African American individuals, though the differences with other subgroups were small and had overlapping 95% CIs. Conclusions: AI assistance was associated with significantly improved diagnostic agreement with dermatologists. Across race and ethnicity subgroups, for both primary care physicians and nurse practitioners, the effect of AI assistance remained high at 8%-15%, and the performance spread was similar at 5%-7%. View details
    Machine learning for clinical operations improvement via case triaging
    Susan Jen Huang
    Kimberly Kanada
    Lily Hao Yi Peng
    Peggy Bui
    Skin Health and Disease (2021)
    Preview abstract In recent years, an increasing number of machine learning (ML) models have been developed for interpreting images of skin conditions and for risk stratification. Beyond accurate image interpretation, one potential application of these interpretations may be triaging systems to help direct care to the right care provider at the right time. This is a critical need because dermatologist appointment wait times exceed a month in many regions, a trend that can potentially be alleviated by rapidly stratifying patients to clinicians with the appropriate level of training (e.g., board-certified dermatologist, advanced practice provider under dermatologist supervision, non-dermatologist) and the appropriate urgency. To help understand ML's potential for this triaging, we analysed a previously-described deep learning system (DLS) that provides a differential diagnosis of teledermatology cases and that improved the diagnostic accuracy of primary care physicians and nurse practitioners in a randomized study. We reordered the cases within each ‘review batch’ of 500 based on the urgency category of the DLS-predicted skin condition (which is an automated process requiring no human intervention). On average, this caused the review order of urgent cases to be prioritised substantially sooner than that of less urgent cases, with the average rank of ‘immediate intervention cases’ being about 100 (vs. 253 without reordering, p < 0.001), and that of ‘no need to see a doctor’ cases being close to 400 (vs. 252 without reordering, p < 0.001). Our approach has the potential to accelerate triaging and reduce the burden on the limited dermatology workforce to focus on patient management. View details
    Preview abstract Importance: Most dermatologic cases are initially evaluated by nondermatologists such as primary care physicians (PCPs) or nurse practitioners (NPs). Objective: To evaluate an artificial intelligence (AI)–based tool that assists with diagnoses of dermatologic conditions. Design, Setting, and Participants: This multiple-reader, multiple-case diagnostic study developed an AI-based tool and evaluated its utility. Primary care physicians and NPs retrospectively reviewed an enriched set of cases representing 120 different skin conditions. Randomization was used to ensure each clinician reviewed each case either with or without AI assistance; each clinician alternated between batches of 50 cases in each modality. The reviews occurred from February 21 to April 28, 2020. Data were analyzed from May 26, 2020, to January 27, 2021. Exposures: An AI-based assistive tool for interpreting clinical images and associated medical history. Main Outcomes and Measures: The primary analysis evaluated agreement with reference diagnoses provided by a panel of 3 dermatologists for PCPs and NPs. Secondary analyses included diagnostic accuracy for biopsy-confirmed cases, biopsy and referral rates, review time, and diagnostic confidence. Results: Forty board-certified clinicians, including 20 PCPs (14 women [70.0%]; mean experience, 11.3 [range, 2-32] years) and 20 NPs (18 women [90.0%]; mean experience, 13.1 [range, 2-34] years) reviewed 1048 retrospective cases (672 female [64.2%]; median age, 43 [interquartile range, 30-56] years; 41 920 total reviews) from a teledermatology practice serving 11 sites and provided 0 to 5 differential diagnoses per case (mean [SD], 1.6 [0.7]). The PCPs were located across 12 states, and the NPs practiced in primary care without physician supervision across 9 states. The NPs had a mean of 13.1 (range, 2-34) years of experience and practiced in primary care without physician supervision across 9 states. Artificial intelligence assistance was significantly associated with higher agreement with reference diagnoses. For PCPs, the increase in diagnostic agreement was 10% (95% CI, 8%-11%; P < .001), from 48% to 58%; for NPs, the increase was 12% (95% CI, 10%-14%; P < .001), from 46% to 58%. In secondary analyses, agreement with biopsy-obtained diagnosis categories of maglignant, precancerous, or benign increased by 3% (95% CI, −1% to 7%) for PCPs and by 8% (95% CI, 3%-13%) for NPs. Rates of desire for biopsies decreased by 1% (95% CI, 0-3%) for PCPs and 2% (95% CI, 1%-3%) for NPs; the rate of desire for referrals decreased by 3% (95% CI, 1%-4%) for PCPs and NPs. Diagnostic agreement on cases not indicated for a dermatologist referral increased by 10% (95% CI, 8%-12%) for PCPs and 12% (95% CI, 10%-14%) for NPs, and median review time increased slightly by 5 (95% CI, 0-8) seconds for PCPs and 7 (95% CI, 5-10) seconds for NPs per case. Conclusions and Relevance: Artificial intelligence assistance was associated with improved diagnoses by PCPs and NPs for 1 in every 8 to 10 cases, indicating potential for improving the quality of dermatologic care. View details
    AI Papers in Ophthalmology Made Simple
    Sohee Jeon
    Ji-Peng Olivia Li
    Lily Peng
    Daniel Ting
    Nature Eye (2020)
    Preview abstract Recently, EYE has published few manuscripts on artificial intelligence (AI) systems based on deep learning (DL). In ophthalmology, with the exponential growth in computational power, ocular imaging quality, and increasing capabilities, several groups have applied AI productively to interpret ocular images for diagnosis, referral management, risk stratification, and prognostication. Clinical implementation has also begun with the first FDA-cleared AI-equipped fundus camera for DR screening in 2018 (IDx-DR; IDx Technologies Inc, Coralville, IA, USA). Many general ophthalmologists may not have a computer science background, and traditional critical analysis skills for clinical studies do not always directly apply to AI studies. This editorial outlines a stepwise approach to help readers critically read the introduction, methods, results, and discussion components of an AI paper, with a view towards how these technologies can potentially be applied in routine clinical practice. View details
    Longitudinal Screening for Diabetic Retinopathy in a Nationwide Screening Program: Comparing Deep Learning and Human Graders
    Jirawut Limwattanayingyong
    Variya Nganthavee
    Kasem Seresirikachorn
    Tassapol Singalavanija
    Ngamphol Soonthornworasiri
    Varis Ruamviboonsuk
    Chetan Rao
    Rajiv Raman
    Andrzej Grzybowski
    Lily Hao Yi Peng
    Fred Hersch
    Richa Tiwari, PhD
    Dr. Paisan Raumviboonsuk
    Journal of Diabetes Research (2020)
    Preview abstract Objective. To evaluate diabetic retinopathy (DR) screening via deep learning (DL) and trained human graders (HG) in a longitudinal cohort, as case spectrum shifts based on treatment referral and new-onset DR. Methods. We randomly selected patients with diabetes screened twice, two years apart within a nationwide screening program. The reference standard was established via adjudication by retina specialists. Each patient’s color fundus photographs were graded, and a patient was considered as having sight-threatening DR (STDR) if the worse eye had severe nonproliferative DR, proliferative DR, or diabetic macular edema. We compared DR screening via two modalities: DL and HG. For each modality, we simulated treatment referral by excluding patients with detected STDR from the second screening using that modality. Results. There were 5,738 patients (12.3% STDR) in the first screening. DL and HG captured different numbers of STDR cases, and after simulated referral and excluding ungradable cases, 4,148 and 4,263 patients remained in the second screening, respectively. The STDR prevalence at the second screening was 5.1% and 6.8% for DL- and HG-based screening, respectively. Along with the prevalence decrease, the sensitivity for both modalities decreased from the first to the second screening (DL: from 95% to 90%, p=0.008; HG: from 74% to 57%, p<0.001). At both the first and second screenings, the rate of false negatives for the DL was a fifth that of HG (0.5-0.6% vs. 2.9-3.2%). Conclusion. On 2-year longitudinal follow-up of a DR screening cohort, STDR prevalence decreased for both DL- and HG-based screening. Follow-up screenings in longitudinal DR screening can be more difficult and induce lower sensitivity for both DL and HG, though the false negative rate was substantially lower for DL. Our data may be useful for health-economics analyses of longitudinal screening settings. View details
    Predicting the risk of developing diabetic retinopathy using deep learning
    Ashish Bora
    Siva Balasubramanian
    Sunny Virmani
    Akinori Mitani
    Guilherme De Oliveira Marinho
    Jorge Cuadros
    Dr. Paisan Raumviboonsuk
    Lily Hao Yi Peng
    Avinash Vaidyanathan Varadarajan
    Lancet Digital Health (2020)
    Preview abstract Background: Diabetic retinopathy screening is instrumental to preventing blindness, but scaling up screening is challenging because of the increasing number of patients with all forms of diabetes. We aimed to create a deep-learning system to predict the risk of patients with diabetes developing diabetic retinopathy within 2 years. Methods: We created and validated two versions of a deep-learning system to predict the development of diabetic retinopathy in patients with diabetes who had had teleretinal diabetic retinopathy screening in a primary care setting. The input for the two versions was either a set of three-field or one-field colour fundus photographs. Of the 575 431 eyes in the development set 28 899 had known outcomes, with the remaining 546 532 eyes used to augment the training process via multitask learning. Validation was done on one eye (selected at random) per patient from two datasets: an internal validation (from EyePACS, a teleretinal screening service in the USA) set of 3678 eyes with known outcomes and an external validation (from Thailand) set of 2345 eyes with known outcomes. Findings: The three-field deep-learning system had an area under the receiver operating characteristic curve (AUC) of 0·79 (95% CI 0·77–0·81) in the internal validation set. Assessment of the external validation set—which contained only one-field colour fundus photographs—with the one-field deep-learning system gave an AUC of 0·70 (0·67–0·74). In the internal validation set, the AUC of available risk factors was 0·72 (0·68–0·76), which improved to 0·81 (0·77–0·84) after combining the deep-learning system with these risk factors (p<0·0001). In the external validation set, the corresponding AUC improved from 0·62 (0·58–0·66) to 0·71 (0·68–0·75; p<0·0001) following the addition of the deep-learning system to available risk factors. Interpretation: The deep-learning systems predicted diabetic retinopathy development using colour fundus photographs, and the systems were independent of and more informative than available risk factors. Such a risk stratification tool might help to optimise screening intervals to reduce costs while improving vision-related outcomes. View details
    Scientific Discovery by Generating Counterfactuals using Image Translation
    Arununachalam Narayanaswamy
    Lily Hao Yi Peng
    Dr. Paisan Raumviboonsuk
    Avinash Vaidyanathan Varadarajan
    Proceedings of MICCAI, International Conference on Medical Image Computing and Computer-Assisted Intervention (2020)
    Preview abstract Visual recognition models are increasingly applied toscientific domains such as, drug studies and medical diag-noses, and model explanation techniques play a critical rolein understanding the source of a model’s performance andmaking its decisions transparent. In this work we investi-gate if explanation techniques can also be used as a mech-anism for scientific discovery. We make two contributions,first we propose a framework to convert predictions from ex-planation techniques to a mechanism of discovery. Secondwe show how generative models in combination with black-box predictors can be used to generate hypotheses (withouthuman priors) that can be critically examined. With thesetechniques we study classification models on retinal fundusimages predicting Diabetic Macular Edema (DME). Essen-tially deep convolutional models on 2D retinal fundus im-ages can do nearly as well as ophthalmologists looking at3D scans, making this an interesting case study of clinicalrelevance. Our work highlights that while existing expla-nation tools are useful, they do not necessarily provide acomplete answer. With the proposed framework we are ableto bridge the gap between model’s performance and humanunderstanding of the underlying mechanism which is of vi-tal scientific interest. View details
    Predicting OCT-derived DME grades from fundus photographs using deep learning
    Arunachalam Narayanaswamy
    Avinash Vaidyanathan Varadarajan
    Dr. Paisan Raumviboonsuk
    Dr. Peranut Chotcomwongse
    Jorge Cuadros
    Lily Hao Yi Peng
    Pearse Keane
    Nature Communications (2020)
    Preview abstract Diabetic eye disease is one of the fastest growing causes of preventable blindness. With the advent of anti-VEGF therapies, it has become increasingly important to detect center-involved DME (ci-DME). However, ci-DME is diagnosed using optical coherence tomography (OCT), which is not generally available at screening sites. Instead, screening programs rely on the detection of hard exudates as a proxy for DME on color fundus photographs, but this often results in a fair number of false positive and false negative calls. We trained a deep learning model to use color fundus images to directly predict grades derived from OCT exams for DME. Our OCT-based model had an AUC of 0.89 (95% CI: 0.87-0.91), which corresponds to a sensitivity of 85% at a specificity of 80%. In comparison, the ophthalmology graders had sensitivities ranging from 82%-85% and specificities ranging from 44%-50%. These metrics correspond to a PPV of 61% (95% CI: 56%-66%) for the OCT-based algorithm and a range of 36-38% (95% CI ranging from 33% -42%) for ophthalmologists. In addition, we used multiple attention techniques to explain how the model is making its prediction. The ability of deep learning algorithms to make clinically relevant predictions that generally requires sophisticated 3D-imaging equipment from simple 2D images has broad relevance to many other applications in medical imaging. View details
    A deep learning system for differential diagnosis of skin diseases
    Clara Eng
    David Way
    Kang Lee
    Peggy Bui
    Kimberly Kanada
    Guilherme de Oliveira Marinho
    Jess Gallegos
    Sara Gabriele
    Vishakha Gupta
    Nalini Singh
    Lily Peng
    Dennis Ai
    Susan Huang
    Carter Dunn
    Nature Medicine (2020)
    Preview abstract Skin conditions affect 1.9 billion people. Because of a shortage of dermatologists, most cases are seen instead by general practitioners with lower diagnostic accuracy. We present a deep learning system (DLS) to provide a differential diagnosis of skin conditions using 16,114 de-identified cases (photographs and clinical data) from a teledermatology practice serving 17 sites. The DLS distinguishes between 26 common skin conditions, representing 80% of cases seen in primary care, while also providing a secondary prediction covering 419 skin conditions. On 963 validation cases, where a rotating panel of three board-certified dermatologists defined the reference standard, the DLS was non-inferior to six other dermatologists and superior to six primary care physicians (PCPs) and six nurse practitioners (NPs) (top-1 accuracy: 0.66 DLS, 0.63 dermatologists, 0.44 PCPs and 0.40 NPs). These results highlight the potential of the DLS to assist general practitioners in diagnosing skin conditions. View details
    Preview abstract Background: Patients with neovascular age-related macular degeneration (AMD) can avoid vision loss via certain therapy. However, methods to predict the progression to neovascular age-related macular degeneration (nvAMD) are lacking. Purpose: To develop and validate a deep learning (DL) algorithm to predict 1-year progression of eyes with no, early, or intermediate AMD to nvAMD, using color fundus photographs (CFP). Design: Development and validation of a DL algorithm. Methods: We trained a DL algorithm to predict 1-year progression to nvAMD, and used 10-fold cross-validation to evaluate this approach on two groups of eyes in the Age-Related Eye Disease Study (AREDS): none/early/intermediate AMD, and intermediate AMD (iAMD) only. We compared the DL algorithm to the manually graded 4-category and 9-step scales in the AREDS dataset. Main outcome measures: Performance of the DL algorithm was evaluated using the sensitivity at 80% specificity for progression to nvAMD. Results: The DL algorithm's sensitivity for predicting progression to nvAMD from none/early/iAMD (78+/-6%) was higher than manual grades from the 9-step scale (67+/-8%) or the 4-category scale (48+/-3%). For predicting progression specifically from iAMD, the DL algorithm's sensitivity (57+/-6%) was also higher compared to the 9-step grades (36+/-8%) and the 4-category grades (20+/-0%). Conclusions: Our DL algorithm performed better in predicting progression to nvAMD than manual grades. Future investigations are required to test the application of this DL algorithm in a real-world clinical setting. View details
    Deep learning versus human graders for classifying diabetic retinopathy severity in a nationwide screening program
    Dr. Paisan Raumviboonsuk
    Dr. Peranut Chotcomwongse
    Rajiv Raman
    Sonia Phene
    Kornwipa Hemarat
    Mongkol Tadarati
    Sukhum Silpa-Archa
    Jirawut Limwattanayingyong
    Chetan Rao
    Oscar Kuruvilla
    Jesse Jung
    Jeffrey Tan
    Surapong Orprayoon
    Chawawat Kangwanwongpaisan
    Ramase Sukumalpaiboon
    Chainarong Luengchaichawang
    Jitumporn Fuangkaew
    Pipat Kongsap
    Lamyong Chualinpha
    Sarawuth Saree
    Srirut Kawinpanitan
    Korntip Mitvongsa
    Siriporn Lawanasakol
    Chaiyasit Thepchatri
    Lalita Wongpichedchai
    Lily Peng
    Nature Partner Journal (npj) Digital Medicine (2019)
    Preview abstract Deep learning algorithms have been used to detect diabetic retinopathy (DR) with specialist-level accuracy. This study aims to validate one such algorithm on a large-scale clinical population, and compare the algorithm performance with that of human graders. A total of 25,326 gradable retinal images of patients with diabetes from the community-based, nationwide screening program of DR in Thailand were analyzed for DR severity and referable diabetic macular edema (DME). Grades adjudicated by a panel of international retinal specialists served as the reference standard. Relative to human graders, for detecting referable DR (moderate NPDR or worse), the deep learning algorithm had significantly higher sensitivity (0.97 vs. 0.74, p < 0.001), and a slightly lower specificity (0.96 vs. 0.98, p < 0.001). Higher sensitivity of the algorithm was also observed for each of the categories of severe or worse NPDR, PDR, and DME (p < 0.001 for all comparisons). The quadratic-weighted kappa for determination of DR severity levels by the algorithm and human graders was 0.85 and 0.78 respectively (p < 0.001 for the difference). Across different severity levels of DR for determining referable disease, deep learning significantly reduced the false negative rate (by 23%) at the cost of slightly higher false positive rates (2%). Deep learning algorithms may serve as a valuable tool for DR screening. View details
    Predicting Anemia from Fundus Images
    Akinori Mitani
    Abigail Huang
    Lily Peng
    Avinash Vaidyanathan Varadarajan
    Nature Biomedical Engineering (2019)
    Preview abstract Owing to the invasiveness of diagnostic tests for anaemia and the costs associated with screening for it, the condition is often undetected. Here, we show that anaemia can be detected via machine-learning algorithms trained using retinal fundus images, study participant metadata (including race or ethnicity, age, sex and blood pressure) or the combination of both data types (images and study participant metadata). In a validation dataset of 11,388 study participants from the UK Biobank, the fundusimage-only, metadata-only and combined models predicted haemoglobin concentration (in g dl–1) with mean absolute error values of 0.73 (95% confidence interval: 0.72–0.74), 0.67 (0.66–0.68) and 0.63 (0.62–0.64), respectively, and with areas under the receiver operating characteristic curve (AUC) values of 0.74 (0.71–0.76), 0.87 (0.85–0.89) and 0.88 (0.86–0.89), respectively. For 539 study participants with self-reported diabetes, the combined model predicted haemoglobin concentration with a mean absolute error of 0.73 (0.68–0.78) and anaemia an AUC of 0.89 (0.85–0.93). Automated anaemia screening on the basis of fundus images could particularly aid patients with diabetes undergoing regular retinal imaging and for whom anaemia can increase morbidity and mortality risks. View details
    Preview abstract Purpose: To present and evaluate a remote, tool-based system and structured grading rubric for adjudicating image-based diabetic retinopathy (DR) grades. Methods: We compared three different procedures for adjudicating DR severity assessments among retina specialist panels, including (1) in-person adjudication based on a previously described procedure (Baseline), (2) remote, tool-based adjudication for assessing DR severity alone (TA), and (3) remote, tool-based adjudication using a feature-based rubric (TA-F). We developed a system allowing graders to review images remotely and asynchronously. For both TA and TA-F approaches, images with disagreement were reviewed by all graders in a round-robin fashion until disagreements were resolved. Five panels of three retina specialists each adjudicated a set of 499 retinal fundus images (1 panel using Baseline, 2 using TA, and 2 using TA-F adjudication). Reliability was measured as grade agreement among the panels using Cohen's quadratically weighted kappa. Efficiency was measured as the number of rounds needed to reach a consensus for tool-based adjudication. Results: The grades from remote, tool-based adjudication showed high agreement with the Baseline procedure, with Cohen's kappa scores of 0.948 and 0.943 for the two TA panels, and 0.921 and 0.963 for the two TA-F panels. Cases adjudicated using TA-F were resolved in fewer rounds compared with TA (P < 0.001; standard permutation test). Conclusions: Remote, tool-based adjudication presents a flexible and reliable alternative to in-person adjudication for DR diagnosis. Feature-based rubrics can help accelerate consensus for tool-based adjudication of DR without compromising label quality. Translational Relevance: This approach can generate reference standards to validate automated methods, and resolve ambiguous diagnoses by integrating into existing telemedical workflows. View details
    Performance of a Deep-Learning Algorithm vs Manual Grading for Detecting Diabetic Retinopathy in India
    Renu P. Rajan
    Derek Wu
    Peter Wubbels
    Tyler Rhodes
    Kira Whitehouse
    Ramasamy Kim
    Rajiv Raman
    Lily Peng
    JAMA Ophthalmology (2019)
    Preview abstract Importance More than 60 million people in India have diabetes and are at risk for diabetic retinopathy (DR), a vision-threatening disease. Automated interpretation of retinal fundus photographs can help support and scale a robust screening program to detect DR. Objective To prospectively validate the performance of an automated DR system across 2 sites in India. Design, Setting, and Participants This prospective observational study was conducted at 2 eye care centers in India (Aravind Eye Hospital and Sankara Nethralaya) and included 3049 patients with diabetes. Data collection and patient enrollment took place between April 2016 and July 2016 at Aravind and May 2016 and April 2017 at Sankara Nethralaya. The model was trained and fixed in March 2016. Interventions Automated DR grading system compared with manual grading by 1 trained grader and 1 retina specialist from each site. Adjudication by a panel of 3 retinal specialists served as the reference standard in the cases of disagreement. Main Outcomes and Measures Sensitivity and specificity for moderate or worse DR or referable diabetic macula edema. Results Of 3049 patients, 1091 (35.8%) were women and the mean (SD) age for patients at Aravind and Sankara Nethralaya was 56.6 (9.0) years and 56.0 (10.0) years, respectively. For moderate or worse DR, the sensitivity and specificity for manual grading by individual nonadjudicator graders ranged from 73.4% to 89.8% and from 83.5% to 98.7%, respectively. The automated DR system’s performance was equal to or exceeded manual grading, with an 88.9% sensitivity (95% CI, 85.8-91.5), 92.2% specificity (95% CI, 90.3-93.8), and an area under the curve of 0.963 on the data set from Aravind Eye Hospital and 92.1% sensitivity (95% CI, 90.1-93.8), 95.2% specificity (95% CI, 94.2-96.1), and an area under the curve of 0.980 on the data set from Sankara Nethralaya. Conclusions and Relevance This study shows that the automated DR system generalizes to this population of Indian patients in a prospective setting and demonstrates the feasibility of using an automated DR grading system to expand screening programs. View details
    Using a deep learning algorithm and integrated gradient explanation to assist grading for diabetic retinopathy
    Ankur Taly
    Anthony Joseph
    Arjun Sood
    Arun Narayanaswamy
    Derek Wu
    Ehsan Rahimy
    Jesse Smith
    Katy Blumer
    Lily Peng
    Michael Shumski
    Scott Barb
    Zahra Rastegar
    Ophthalmology (2019)
    Preview abstract Background Deep learning methods have recently produced algorithms that can detect disease such as diabetic retinopathy (DR) with doctor-level accuracy. We sought to understand the impact of these models on physician graders in assisted-read settings. Methods We surfaced model predictions and explanation maps ("masks") to 9 ophthalmologists with varying levels of experience to read 1,804 images each for DR severity based on the International Clinical Diabetic Retinopathy (ICDR) disease severity scale. The image sample was representative of the diabetic screening population, and was adjudicated by 3 retina specialists for a reference standard. Doctors read each image in one of 3 conditions: Unassisted, Grades Only, or Grades+Masks. Findings Readers graded DR more accurately with model assistance than without (p < 0.001, logistic regression). Compared to the adjudicated reference standard, for cases with disease, 5-class accuracy was 57.5% for the model. For graders, 5-class accuracy for cases with disease was 47.5 ± 5.6% unassisted, 56.9 ± 5.5% with Grades Only, and 61.5 ± 5.5% with Grades+Mask. Reader performance improved with assistance across all levels of DR, including for severe and proliferative DR. Model assistance increased the accuracy of retina fellows and trainees above that of the unassisted grader or model alone. Doctors’ grading confidence scores and read times both increased overall with assistance. For most cases, Grades + Masks was as only effective as Grades Only, though masks provided additional benefit over grades alone in cases with: some DR and low model certainty; low image quality; and proliferative diabetic retinopathy (PDR) with features that were frequently missed, such as panretinal photocoagulation (PRP) scars. Interpretation Taken together, these results show that deep learning models can improve the accuracy of, and confidence in, DR diagnosis in an assisted read setting. View details
    Preview abstract Purpose To develop and validate a deep learning (DL) algorithm that predicts referable glaucomatous optic neuropathy (GON) and optic nerve head (ONH) features from color fundus images, to determine the relative importance of these features in referral decisions by glaucoma specialists (GSs) and the algorithm, and to compare the performance of the algorithm with eye care providers. Design Development and validation of an algorithm. Participants Fundus images from screening programs, studies, and a glaucoma clinic. Methods A DL algorithm was trained using a retrospective dataset of 86 618 images, assessed for glaucomatous ONH features and referable GON (defined as ONH appearance worrisome enough to justify referral for comprehensive examination) by 43 graders. The algorithm was validated using 3 datasets: dataset A (1205 images, 1 image/patient; 18.1% referable), images adjudicated by panels of GSs; dataset B (9642 images, 1 image/patient; 9.2% referable), images from a diabetic teleretinal screening program; and dataset C (346 images, 1 image/patient; 81.7% referable), images from a glaucoma clinic. Main Outcome Measures The algorithm was evaluated using the area under the receiver operating characteristic curve (AUC), sensitivity, and specificity for referable GON and glaucomatous ONH features. Results The algorithm’s AUC for referable GON was 0.945 (95% confidence interval [CI], 0.929–0.960) in dataset A, 0.855 (95% CI, 0.841–0.870) in dataset B, and 0.881 (95% CI, 0.838–0.918) in dataset C. Algorithm AUCs ranged between 0.661 and 0.973 for glaucomatous ONH features. The algorithm showed significantly higher sensitivity than 7 of 10 graders not involved in determining the reference standard, including 2 of 3 GSs, and showed higher specificity than 3 graders (including 1 GS), while remaining comparable to others. For both GSs and the algorithm, the most crucial features related to referable GON were: presence of vertical cup-to-disc ratio of 0.7 or more, neuroretinal rim notching, retinal nerve fiber layer defect, and bared circumlinear vessels. Conclusions A DL algorithm trained on fundus images alone can detect referable GON with higher sensitivity than and comparable specificity to eye care providers. The algorithm maintained good performance on an independent dataset with diagnoses based on a full glaucoma workup. View details
    Preview abstract Purpose Use adjudication to quantify errors in diabetic retinopathy (DR) grading based on individual graders and majority decision, and to train an improved automated algorithm for DR grading. Design Retrospective analysis. Participants Retinal fundus images from DR screening programs. Methods Images were each graded by the algorithm, U.S. board-certified ophthalmologists, and retinal specialists. The adjudicated consensus of the retinal specialists served as the reference standard. Main Outcome Measures For agreement between different graders as well as between the graders and the algorithm, we measured the (quadratic-weighted) kappa score. To compare the performance of different forms of manual grading and the algorithm for various DR severity cutoffs (e.g., mild or worse DR, moderate or worse DR), we measured area under the curve (AUC), sensitivity, and specificity. Results Of the 193 discrepancies between adjudication by retinal specialists and majority decision of ophthalmologists, the most common were missing microaneurysm (MAs) (36%), artifacts (20%), and misclassified hemorrhages (16%). Relative to the reference standard, the kappa for individual retinal specialists, ophthalmologists, and algorithm ranged from 0.82 to 0.91, 0.80 to 0.84, and 0.84, respectively. For moderate or worse DR, the majority decision of ophthalmologists had a sensitivity of 0.838 and specificity of 0.981. The algorithm had a sensitivity of 0.971, specificity of 0.923, and AUC of 0.986. For mild or worse DR, the algorithm had a sensitivity of 0.970, specificity of 0.917, and AUC of 0.986. By using a small number of adjudicated consensus grades as a tuning dataset and higher-resolution images as input, the algorithm improved in AUC from 0.934 to 0.986 for moderate or worse DR. Conclusions Adjudication reduces the errors in DR grading. A small set of adjudicated DR grades allows substantial improvements in algorithm performance. The resulting algorithm's performance was on par with that of individual U.S. Board-Certified ophthalmologists and retinal specialists. View details
    Predicting Cardiovascular Risk Factors in Retinal Fundus Photographs using Deep Learning
    Avinash Vaidyanathan Varadarajan
    Katy Blumer
    Mike McConnell
    Lily Peng
    Nature Biomedical Engineering (2018)
    Preview abstract Traditionally, medical discoveries are made by observing associations, making hypotheses from them and then designing and running experiments to test the hypotheses. However, with medical images, observing and quantifying associations can often be difficult because of the wide variety of features, patterns, colours, values and shapes that are present in real data. Here, we show that deep learning can extract new knowledge from retinal fundus images. Using deep-learning models trained on data from 284,335 patients and validated on two independent datasets of 12,026 and 999 patients, we predicted cardiovascular risk factors not previously thought to be present or quantifiable in retinal images, such as age (mean absolute error within 3.26 years), gender (area under the receiver operating characteristic curve (AUC) = 0.97), smoking status (AUC = 0.71), systolic blood pressure (mean absolute error within 11.23 mmHg) and major adverse cardiac events (AUC = 0.70). We also show that the trained deep-learning models used anatomical features, such as the optic disc or blood vessels, to generate each prediction. View details
    Preview abstract Objective: Refractive error, one of the leading cause of visual impairment, can be corrected by simple interventions like prescribing eyeglasses, which often starts with autorefraction to estimate the refractive error. In this study, using deep learning, we trained a network to estimate refractive error from fundus photos only. Design: Retrospective analysis. Subjects, Participants, and/or Controls: Retinal fundus images from participants in the UK Biobank cohort, which were 45 degree field of view images and the AREDS clinical trial, which contained 30 degree field of view images. Methods, Intervention, or Testing: Refractive error was measured by autorefraction in the UK Biobank dataset and subjective refraction in the AREDS dataset. We trained a deep learning algorithm to predict refractive error from the fundus photographs and tested the prediction of the algorithm to the documented refractive error measurement. Our model used attention for identifying features that are predictive for refractive error. Main Outcome Measures: Mean average error (MAE) of the algorithm’s prediction compared to the refractive error obtained in the AREDS and UK Biobank. Results: The resulting algorithm had a mean average error (MAE) of 0.56 diopters (95% CI: 0.55-0.56) for estimating spherical equivalent on the UK Biobank dataset and 0.91 diopters (95% CI: 0.89-0.92) for the AREDS dataset. The baseline expected MAE (obtained by simply predicting the mean of this population) is 1.81 diopters (95% CI: 1.79-1.84) for UK Biobank and 1.63 (95% CI: 1.60-1.67) for AREDS. Attention maps suggest that the foveal region is one of the most important areas that is used by the algorithm to make this prediction, though other regions also contribute to the prediction. Conclusions: The ability to estimate refractive error with high accuracy from retinal fundus photos has not been previously known and demonstrates that deep learning can be applied to make novel predictions from medical images. In addition, given that several groups have recently shown that it is feasible to obtain retinal fundus photos using mobile phones and inexpensive attachments, this work may be particularly relevant in regions of the world where autorefractors may not be readily available. View details
    Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs
    Lily Peng
    Martin C Stumpe
    Derek Wu
    Arunachalam Narayanaswamy
    Subhashini Venugopalan
    Tom Madams
    Jorge Cuadros
    Ramasamy Kim
    Rajiv Raman
    Jessica Mega
    JAMA (2016)
    Preview abstract Importance: Deep learning is a family of computational methods that allow an algorithm to program itself by learning from a large set of examples that demonstrate the desired behavior, removing the need to specify rules explicitly. Application of these methods to medical imaging requires further assessment and validation. Objective: To apply deep learning to create an algorithm for automated detection of diabetic retinopathy and diabetic macular edema in retinal fundus photographs. Design and Setting: A specific type of neural network optimized for image classification called a deep convolutional neural network was trained using a retrospective development data set of 128 175 retinal images, which were graded 3 to 7 times for diabetic retinopathy, diabetic macular edema, and image gradability by a panel of 54 US licensed ophthalmologists and ophthalmology senior residents between May and December 2015. The resultant algorithm was validated in January and February 2016 using 2 separate data sets, both graded by at least 7 US board-certified ophthalmologists with high intragrader consistency. Exposure: Deep learning–trained algorithm. Main Outcomes and Measures: The sensitivity and specificity of the algorithm for detecting referable diabetic retinopathy (RDR), defined as moderate and worse diabetic retinopathy, referable diabetic macular edema, or both, were generated based on the reference standard of the majority decision of the ophthalmologist panel. The algorithm was evaluated at 2 operating points selected from the development set, one selected for high specificity and another for high sensitivity. Results: The EyePACS-1 data set consisted of 9963 images from 4997 patients (mean age, 54.4 years; 62.2% women; prevalence of RDR, 683/8878 fully gradable images [7.8%]); the Messidor-2 data set had 1748 images from 874 patients (mean age, 57.6 years; 42.6% women; prevalence of RDR, 254/1745 fully gradable images [14.6%]). For detecting RDR, the algorithm had an area under the receiver operating curve of 0.991 (95% CI, 0.988-0.993) for EyePACS-1 and 0.990 (95% CI, 0.986-0.995) for Messidor-2. Using the first operating cut point with high specificity, for EyePACS-1, the sensitivity was 90.3% (95% CI, 87.5%-92.7%) and the specificity was 98.1% (95% CI, 97.8%-98.5%). For Messidor-2, the sensitivity was 87.0% (95% CI, 81.1%-91.0%) and the specificity was 98.5% (95% CI, 97.7%-99.1%). Using a second operating point with high sensitivity in the development set, for EyePACS-1 the sensitivity was 97.5% and specificity was 93.4% and for Messidor-2 the sensitivity was 96.1% and specificity was 93.9%. Conclusions and Relevance: In this evaluation of retinal fundus photographs from adults with diabetes, an algorithm based on deep machine learning had high sensitivity and specificity for detecting referable diabetic retinopathy. Further research is necessary to determine the feasibility of applying this algorithm in the clinical setting and to determine whether use of the algorithm could lead to improved care and outcomes compared with current ophthalmologic assessment. View details
    Massively Multitask Networks for Drug Discovery
    Bharath Ramsundar
    Steven Kearnes
    Patrick Riley
    David Konerding
    Vijay Pande
    arXiv:1502.02072 [stat.ML] (2015)
    Preview abstract Massively multitask neural architectures provide a learning framework for drug discovery that synthesizes information from many distinct biological sources. To train these architectures at scale, we gather large amounts of data from public sources to create a dataset of nearly 40 million measurements across more than 200 biological targets. We investigate several aspects of the multitask framework by performing a series of empirical studies and obtain some interesting results: (1) massively multitask networks obtain predictive accuracies significantly better than single-task methods, (2) the predictive power of multitask networks improves as additional tasks and data are added, (3) the total amount of data and the total number of tasks both contribute significantly to multitask improvement, and (4) multitask networks afford limited transferability to tasks not in the training set. Our results underscore the need for greater data sharing and further algorithmic innovation to accelerate the drug discovery process. View details
    Towards better measurement of attention and satisfaction in mobile search
    Dmitry Lagun
    Chih-Hung Hsieh
    SIGIR '14 Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval (2014), pp. 113-122
    Preview abstract Web Search has seen two big changes recently: rapid growth in mobile search traffic, and an increasing trend towards providing answer-like results for relatively simple information needs (e.g., [weather today]). Such results display the answer or relevant information on the search page itself without requiring a user to click. While clicks on organic search results have been used extensively to infer result relevance and search satisfaction, clicks on answer-like results are often rare (or meaningless), making it challenging to evaluate answer quality. Together, these call for better measurement and understanding of search satisfaction on mobile devices. In this paper, we studied whether tracking the browser viewport (visible portion of a web page) on mobile phones could enable accurate measurement of user attention at scale, and provide good measurement of search satisfaction in the absence of clicks. Focusing on answer-like results in web search, we designed a lab study to systematically vary answer presence and relevance (to the user's information need), obtained satisfaction ratings from users, and simultaneously recorded eye gaze and viewport data as users performed search tasks. Using this ground truth, we identified increased scrolling past answer and increased time below answer as clear, measurable signals of user dissatisfaction with answers. While the viewport may contain three to four results at any given time, we found strong correlations between gaze duration and viewport duration on a per result basis, and that the average user attention is focused on the top half of the phone screen, suggesting that we may be able to scalably and reliably identify which specific result the user is looking at, from viewport data alone. View details
    No Results Found