
Jonathan Krause
Research Areas
Authored Publications
Sort By
Safety principles for medical summarization using generative AI
Christopher Kelly
Chris Farrance
Praney Mittal
Donny Cheung
Nicola Ding
Heather Cole-Lewis
Madeleine Elish
Dillon Obika
Nature Medicine (2024)
Preview abstract
The introduction of Generative AI, particularly large language models presents exciting opportunities for healthcare. However their novel capabilities also have the potential to introduce novel risks and hazards. This paper explores the unique safety challenges associated with LLMs in healthcare, using medical text summarization as a motivating example. Using MedLM as a case example, we propose leveraging existing standards and guidance while developing novel approaches tailored to the specific characteristics of LLMs.
View details
Lessons learned from translating AI from development to deployment in healthcare
Lily Peng
Elin Rønby Pedersen
Divleen Jeji
Sunny Virmani
Jay Nayar
Nature Medicine (2023)
Preview abstract
The application of an artificial intelligence (AI)-based screening tool for retinal disease in India and Thailand highlighted the myths and reality of introducing medical AI, which may form a framework for subsequent tools.
View details
A deep learning model for novel systemic biomarkers in photographs of the external eye: a retrospective study
April Y. Maa
Ramasamy Kim
Lauren P. Daskivich
Eugene Yu-Chuan Kang
Jorge Cuadros
Lily Peng
Christina Chen
Ilana Traynis
Akib Uddin
Avinash Varadarajan
The Lancet Digital Health (2023)
Preview abstract
Background
Photographs of the external eye were recently shown to reveal signs of diabetic retinal disease and elevated glycated haemoglobin. This study aimed to test the hypothesis that external eye photographs contain information about additional systemic medical conditions.
Methods
We developed a deep learning system (DLS) that takes external eye photographs as input and predicts systemic parameters, such as those related to the liver (albumin, aspartate aminotransferase [AST]); kidney (estimated glomerular filtration rate [eGFR], urine albumin-to-creatinine ratio [ACR]); bone or mineral (calcium); thyroid (thyroid stimulating hormone); and blood (haemoglobin, white blood cells [WBC], platelets). This DLS was trained using 123 130 images from 38 398 patients with diabetes undergoing diabetic eye screening in 11 sites across Los Angeles county, CA, USA. Evaluation focused on nine prespecified systemic parameters and leveraged three validation sets (A, B, C) spanning 25 510 patients with and without diabetes undergoing eye screening in three independent sites in Los Angeles county, CA, and the greater Atlanta area, GA, USA. We compared performance against baseline models incorporating available clinicodemographic variables (eg, age, sex, race and ethnicity, years with diabetes).
Findings
Relative to the baseline, the DLS achieved statistically significant superior performance at detecting AST >36.0 U/L, calcium <8.6 mg/dL, eGFR <60.0 mL/min/1.73 m2, haemoglobin <11.0 g/dL, platelets <150.0 × 103/μL, ACR ≥300 mg/g, and WBC <4.0 × 103/μL on validation set A (a population resembling the development datasets), with the area under the receiver operating characteristic curve (AUC) of the DLS exceeding that of the baseline by 5.3–19.9% (absolute differences in AUC). On validation sets B and C, with substantial patient population differences compared with the development datasets, the DLS outperformed the baseline for ACR ≥300.0 mg/g and haemoglobin <11.0 g/dL by 7.3–13.2%.
Interpretation
We found further evidence that external eye photographs contain biomarkers spanning multiple organ systems. Such biomarkers could enable accessible and non-invasive screening of disease. Further work is needed to understand the translational implications.
View details
Domain-specific optimization and diverse evaluation of self-supervised models for histopathology
Cameron Chen
Jessica Loo
Fayaz Jamil
Supriya Vijay
Jeremy Lai
Saloni Agarwal
Saurabh Vyawahare
arXiv (2023)
Preview abstract
Task-specific deep learning models in histopathology offer promising opportunities for improving diagnosis, clinical research, and precision medicine. However, development of such models is often limited by availability of high-quality data. Foundation models in histopathology that learn general representations across a wide range of tissue types, diagnoses, and magnifications offer the potential to reduce the data, compute, and technical expertise necessary to develop task-specific deep learning models with the required level of model performance. In this work, we describe the development and evaluation of foundation models for histopathology via self-supervised learning (SSL). We first establish a diverse set of benchmark tasks involving 17 unique tissue types and 12 unique cancer types and spanning different optimal magnifications and task types. Next, we use this benchmark to explore and evaluate histopathology-specific SSL methods followed by further evaluation on held out patch-level and weakly supervised tasks. We found that standard SSL methods thoughtfully applied to histopathology images are performant across our benchmark tasks and that domain-specific methodological improvements can further increase performance. Our findings reinforce the value of using domain-specific SSL methods in pathology, and establish a set of high quality foundation models to enable further research across diverse applications.
View details
Performance of a Diabetic Retinopathy Artificial Intelligence Algorithm for Ultra-widefield Imaging
Barba Hamill
Lloyd Aiello
Jerry Cavallerano
Srinivas R Sadda
Drew Lewis
Sara Ellen Godek
Dana Keane
Tunde Peto
Anne Marie Cairns
Lily Peng
Lu Yang
Naho Kitade
Kira Whitehouse
Sunny Virmani
ARVO (2022)
Preview abstract
Purpose: To evaluate the performance of a deep learning model for diabetic retinopathy (DR) and diabetic macular edema screening when using ultra-widefield (UWF) imaging.
Methods: For model development, 67,200 UWF images were collected from DR programs and ophthalmology clinics worldwide. 30,836 images were double graded and adjudicated at 8 grading centres by 125 certified graders using ETDRS extension of the Modified Airlie House Classification of Diabetic Retinopathy following the JVN Clinical Trial Ultrawide Field Grading Manual v1.0. The grading system used traditional ETDRS 7-SF field definition as well as extended fields 3-7 to evaluate the retinal periphery. A further 36,364 UWF images were graded using a grading protocol based on the ICDR classification. The dataset was split into training, tuning and testing. The final DR model is an ensemble of 10 EfficientNet-b0 neural networks, independently trained with standard image augmentation techniques. For model validation, two independent sets of images were collected. Model performance was evaluated by comparing its predictions to the adjudicated ground truth for both sets of images.
Results: Prior to clinical validation, the model performance was internally evaluated on an independent set of 1967 images, of which 1050 were graded via adjudication as negative for more than mild diabetic retinopathy (mtmDR negative), and 917 as having referable diabetic retinopathy (mtmDR positive). The overall performance (Table 1) was weighted by target DR distribution. Clinical validation evaluated an independent data set of 420 images selected to achieve a target distribution that enabled appropriate confidence intervals for mtmDR sensitivity and specificity A panel of three graders adjudicated these 420 images and assessed 241 as mtmDR negative, 179 as mtmDR positive and 135 as vtDR positive. Model’s performance on the clinical validation set is shown in Table 2.
Conclusions: The deep learning model was developed with high quality graded UWF images and performed at a level that highly suggests usefulness in a clinical screening setting. A large, prospective multi-center clinical trial is currently evaluating the performance of a similar model in a real-world clinical setting.
This abstract was presented at the 2022 ARVO Annual Meeting, held in Denver, CO, May 1-4, 2022, and virtually.
View details
Longitudinal Screening for Diabetic Retinopathy in a Nationwide Screening Program: Comparing Deep Learning and Human Graders
Chetan Rao
Varis Ruamviboonsuk
Rajiv Raman
Jirawut Limwattanayingyong
Ngamphol Soonthornworasiri
Andrzej Grzybowski
Richa Tiwari, PhD
Variya Nganthavee
Kasem Seresirikachorn
Tassapol Singalavanija
Dr. Paisan Raumviboonsuk
Fred Hersch
Lily Hao Yi Peng
Journal of Diabetes Research (2020)
Preview abstract
Objective.
To evaluate diabetic retinopathy (DR) screening via deep learning (DL) and trained human graders (HG) in a longitudinal cohort, as case spectrum shifts based on treatment referral and new-onset DR.
Methods.
We randomly selected patients with diabetes screened twice, two years apart within a nationwide screening program. The reference standard was established via adjudication by retina specialists. Each patient’s color fundus photographs were graded, and a patient was considered as having sight-threatening DR (STDR) if the worse eye had severe nonproliferative DR, proliferative DR, or diabetic macular edema. We compared DR screening via two modalities: DL and HG. For each modality, we simulated treatment referral by excluding patients with detected STDR from the second screening using that modality.
Results.
There were 5,738 patients (12.3% STDR) in the first screening. DL and HG captured different numbers of STDR cases, and after simulated referral and excluding ungradable cases, 4,148 and 4,263 patients remained in the second screening, respectively. The STDR prevalence at the second screening was 5.1% and 6.8% for DL- and HG-based screening, respectively. Along with the prevalence decrease, the sensitivity for both modalities decreased from the first to the second screening (DL: from 95% to 90%, p=0.008; HG: from 74% to 57%, p<0.001). At both the first and second screenings, the rate of false negatives for the DL was a fifth that of HG (0.5-0.6% vs. 2.9-3.2%).
Conclusion.
On 2-year longitudinal follow-up of a DR screening cohort, STDR prevalence decreased for both DL- and HG-based screening. Follow-up screenings in longitudinal DR screening can be more difficult and induce lower sensitivity for both DL and HG, though the false negative rate was substantially lower for DL. Our data may be useful for health-economics analyses of longitudinal screening settings.
View details
Using a deep learning algorithm and integrated gradient explanation to assist grading for diabetic retinopathy
Arjun Sood
Michael Shumski
Anthony Joseph
Scott Barb
Ehsan Rahimy
Jesse Smith
Derek Wu
Arun Narayanaswamy
Lily Peng
Katy Blumer
Ankur Taly
Zahra Rastegar
Ophthalmology (2019)
Preview abstract
Background Deep learning methods have recently produced algorithms that can detect disease such as diabetic retinopathy (DR) with doctor-level accuracy. We sought to understand the impact of these models on physician graders in assisted-read settings.
Methods We surfaced model predictions and explanation maps ("masks") to 9 ophthalmologists with varying levels of experience to read 1,804 images each for DR severity based on the International Clinical Diabetic Retinopathy (ICDR) disease severity scale. The image sample was representative of the diabetic screening population, and was adjudicated by 3 retina specialists for a reference standard. Doctors read each image in one of 3 conditions: Unassisted, Grades Only, or Grades+Masks.
Findings Readers graded DR more accurately with model assistance than without (p < 0.001, logistic regression). Compared to the adjudicated reference standard, for cases with disease, 5-class accuracy was 57.5% for the model. For graders, 5-class accuracy for cases with disease was 47.5 ± 5.6% unassisted, 56.9 ± 5.5% with Grades Only, and 61.5 ± 5.5% with Grades+Mask. Reader performance improved with assistance across all levels of DR, including for severe and proliferative DR. Model assistance increased the accuracy of retina fellows and trainees above that of the unassisted grader or model alone. Doctors’ grading confidence scores and read times both increased overall with assistance. For most cases, Grades + Masks was as only effective as Grades Only, though masks provided additional benefit over grades alone in cases with: some DR and low model certainty; low image quality; and proliferative diabetic retinopathy (PDR) with features that were frequently missed, such as panretinal photocoagulation (PRP) scars.
Interpretation Taken together, these results show that deep learning models can improve the accuracy of, and confidence in, DR diagnosis in an assisted read setting.
View details
Deep Learning and Glaucoma Specialists: The Relative Importance of Optic Disc Features to Predict Glaucoma Referral in Fundus Photographs
Felipe Medeiros
April Maa
Arielle Spitze
Monica Gandhi
Derek Wu
Lily Peng
Naho Kitade
Ashish Bora
Sonia Phene
Anita Misra
Abigail Huang
Carter Dunn
Ophthalmology (2019)
Preview abstract
Purpose
To develop and validate a deep learning (DL) algorithm that predicts referable glaucomatous optic neuropathy (GON) and optic nerve head (ONH) features from color fundus images, to determine the relative importance of these features in referral decisions by glaucoma specialists (GSs) and the algorithm, and to compare the performance of the algorithm with eye care providers.
Design
Development and validation of an algorithm.
Participants
Fundus images from screening programs, studies, and a glaucoma clinic.
Methods
A DL algorithm was trained using a retrospective dataset of 86 618 images, assessed for glaucomatous ONH features and referable GON (defined as ONH appearance worrisome enough to justify referral for comprehensive examination) by 43 graders. The algorithm was validated using 3 datasets: dataset A (1205 images, 1 image/patient; 18.1% referable), images adjudicated by panels of GSs; dataset B (9642 images, 1 image/patient; 9.2% referable), images from a diabetic teleretinal screening program; and dataset C (346 images, 1 image/patient; 81.7% referable), images from a glaucoma clinic.
Main Outcome Measures
The algorithm was evaluated using the area under the receiver operating characteristic curve (AUC), sensitivity, and specificity for referable GON and glaucomatous ONH features.
Results
The algorithm’s AUC for referable GON was 0.945 (95% confidence interval [CI], 0.929–0.960) in dataset A, 0.855 (95% CI, 0.841–0.870) in dataset B, and 0.881 (95% CI, 0.838–0.918) in dataset C. Algorithm AUCs ranged between 0.661 and 0.973 for glaucomatous ONH features. The algorithm showed significantly higher sensitivity than 7 of 10 graders not involved in determining the reference standard, including 2 of 3 GSs, and showed higher specificity than 3 graders (including 1 GS), while remaining comparable to others. For both GSs and the algorithm, the most crucial features related to referable GON were: presence of vertical cup-to-disc ratio of 0.7 or more, neuroretinal rim notching, retinal nerve fiber layer defect, and bared circumlinear vessels.
Conclusions
A DL algorithm trained on fundus images alone can detect referable GON with higher sensitivity than and comparable specificity to eye care providers. The algorithm maintained good performance on an independent dataset with diagnoses based on a full glaucoma workup.
View details
Deep learning versus human graders for classifying diabetic retinopathy severity in a nationwide screening program
Rajiv Raman
Siriporn Lawanasakol
Sukhum Silpa-Archa
Chetan Rao
Surapong Orprayoon
Dr. Paisan Raumviboonsuk
Srirut Kawinpanitan
Kornwipa Hemarat
Jitumporn Fuangkaew
Jirawut Limwattanayingyong
Pipat Kongsap
Chainarong Luengchaichawang
Sarawuth Saree
Chaiyasit Thepchatri
Jesse Jung
Lalita Wongpichedchai
Jeffrey Tan
Korntip Mitvongsa
Oscar Kuruvilla
Chawawat Kangwanwongpaisan
Mongkol Tadarati
Lamyong Chualinpha
Ramase Sukumalpaiboon
Dr. Peranut Chotcomwongse
Lily Peng
Sonia Phene
Nature Partner Journal (npj) Digital Medicine (2019)
Preview abstract
Deep learning algorithms have been used to detect diabetic retinopathy (DR) with specialist-level accuracy. This study aims to validate one such algorithm on a large-scale clinical population, and compare the algorithm performance with that of human graders. A total of 25,326 gradable retinal images of patients with diabetes from the community-based, nationwide screening program of DR in Thailand were analyzed for DR severity and referable diabetic macular edema (DME). Grades adjudicated by a panel of international retinal specialists served as the reference standard. Relative to human graders, for detecting referable DR (moderate NPDR or worse), the deep learning algorithm had significantly higher sensitivity (0.97 vs. 0.74, p < 0.001), and a slightly lower specificity (0.96 vs. 0.98, p < 0.001). Higher sensitivity of the algorithm was also observed for each of the categories of severe or worse NPDR, PDR, and DME (p < 0.001 for all comparisons). The quadratic-weighted kappa for determination of DR severity levels by the algorithm and human graders was 0.85 and 0.78 respectively (p < 0.001 for the difference). Across different severity levels of DR for determining referable disease, deep learning significantly reduced the false negative rate (by 23%) at the cost of slightly higher false positive rates (2%). Deep learning algorithms may serve as a valuable tool for DR screening.
View details
Remote Tool-based Adjudication for Grading Diabetic Retinopathy
Edith Law
Brian Basham
Lily Peng
Xiang Ji
Tayyeba Ali
Will Chen
Translational Vision Science & Technology (TVST) (2019)
Preview abstract
Purpose: To present and evaluate a remote, tool-based system and structured grading rubric for adjudicating image-based diabetic retinopathy (DR) grades.
Methods: We compared three different procedures for adjudicating DR severity assessments among retina specialist panels, including (1) in-person adjudication based on a previously described procedure (Baseline), (2) remote, tool-based adjudication for assessing DR severity alone (TA), and (3) remote, tool-based adjudication using a feature-based rubric (TA-F). We developed a system allowing graders to review images remotely and asynchronously. For both TA and TA-F approaches, images with disagreement were reviewed by all graders in a round-robin fashion until disagreements were resolved. Five panels of three retina specialists each adjudicated a set of 499 retinal fundus images (1 panel using Baseline, 2 using TA, and 2 using TA-F adjudication). Reliability was measured as grade agreement among the panels using Cohen's quadratically weighted kappa. Efficiency was measured as the number of rounds needed to reach a consensus for tool-based adjudication.
Results: The grades from remote, tool-based adjudication showed high agreement with the Baseline procedure, with Cohen's kappa scores of 0.948 and 0.943 for the two TA panels, and 0.921 and 0.963 for the two TA-F panels. Cases adjudicated using TA-F were resolved in fewer rounds compared with TA (P < 0.001; standard permutation test).
Conclusions: Remote, tool-based adjudication presents a flexible and reliable alternative to in-person adjudication for DR diagnosis. Feature-based rubrics can help accelerate consensus for tool-based adjudication of DR without compromising label quality.
Translational Relevance: This approach can generate reference standards to validate automated methods, and resolve ambiguous diagnoses by integrating into existing telemedical workflows.
View details