David Coz
Authored Publications
Sort By
Race- and Ethnicity-Stratified Analysis of an Artificial Intelligence–Based Tool for Skin Condition Diagnosis by Primary Care Physicians and Nurse Practitioners
David Way
Vishakha Gupta
Yi Gao
Guilherme De Oliveira Marinho
Jay David Hartford
Kimberly Kanada
Clara Eng
Kunal Nagpal
Lily Hao Yi Peng
Carter Dunn
Susan Jen Huang
Peggy Bui
(2022)
Preview abstract
Background:
Many dermatologic cases are first evaluated by primary care physicians or nurse practitioners.
Objective:
This study aimed to evaluate an artificial intelligence (AI)-based tool that assists with interpreting dermatologic conditions.
Methods:
We developed an AI-based tool and conducted a randomized multi-reader, multi-case study (20 primary care physicians, 20 nurse practitioners, and 1047 retrospective teledermatology cases) to evaluate its utility. Cases were enriched and comprised 120 skin conditions. Readers were recruited to optimize for geographical diversity; the primary care physicians practiced across 12 states (2-32 years of experience, mean 11.3 years), and the nurse practitioners practiced across 9 states (2-34 years of experience, mean 13.1 years). To avoid memory effects from incomplete washout, each case was read once by each clinician either with or without AI assistance, with the assignment randomized. The primary analyses evaluated the top-1 agreement, defined as the agreement rate of the clinicians’ primary diagnosis with the reference diagnoses provided by a panel of dermatologists (per case: 3 dermatologists from a pool of 12, practicing across 8 states, with 5-13 years of experience, mean 7.2 years of experience). We additionally conducted subgroup analyses stratified by cases’ self-reported race and ethnicity and measured the performance spread: the maximum performance subtracted by the minimum across subgroups.
Results:
The AI’s standalone top-1 agreement was 63%, and AI assistance was significantly associated with higher agreement with reference diagnoses. For primary care physicians, the increase in diagnostic agreement was 10% (P<.001), from 48% to 58%; for nurse practitioners, the increase was 12% (P<.001), from 46% to 58%. When stratified by cases’ self-reported race or ethnicity, the AI’s performance was 59%-62% for Asian, Native Hawaiian, Pacific Islander, other, and Hispanic or Latinx individuals and 67% for both Black or African American and White subgroups. For the clinicians, AI assistance–associated improvements across subgroups were in the range of 8%-12% for primary care physicians and 8%-15% for nurse practitioners. The performance spread across subgroups was 5.3% unassisted vs 6.6% assisted for primary care physicians and 5.2% unassisted vs 6.0% assisted for nurse practitioners. In both unassisted and AI-assisted modalities, and for both primary care physicians and nurse practitioners, the subgroup with the highest performance on average was Black or African American individuals, though the differences with other subgroups were small and had overlapping 95% CIs.
Conclusions:
AI assistance was associated with significantly improved diagnostic agreement with dermatologists. Across race and ethnicity subgroups, for both primary care physicians and nurse practitioners, the effect of AI assistance remained high at 8%-15%, and the performance spread was similar at 5%-7%.
View details
Development and Assessment of an Artificial Intelligence–Based Tool for Skin Condition Diagnosis by Primary Care Physicians and Nurse Practitioners in Teledermatology Practices
David Way
Vishakha Gupta
Yi Gao
Guilherme De Oliveira Marinho
Jay David Hartford
Kimberly Kanada
Clara Eng
Kunal Nagpal
Lily Hao Yi Peng
Carter Dunn
Susan Jen Huang
Peggy Bui
JAMA Network Open (2021)
Preview abstract
Importance: Most dermatologic cases are initially evaluated by nondermatologists such as primary care physicians (PCPs) or nurse practitioners (NPs).
Objective: To evaluate an artificial intelligence (AI)–based tool that assists with diagnoses of dermatologic conditions.
Design, Setting, and Participants: This multiple-reader, multiple-case diagnostic study developed an AI-based tool and evaluated its utility. Primary care physicians and NPs retrospectively reviewed an enriched set of cases representing 120 different skin conditions. Randomization was used to ensure each clinician reviewed each case either with or without AI assistance; each clinician alternated between batches of 50 cases in each modality. The reviews occurred from February 21 to April 28, 2020. Data were analyzed from May 26, 2020, to January 27, 2021.
Exposures: An AI-based assistive tool for interpreting clinical images and associated medical history.
Main Outcomes and Measures: The primary analysis evaluated agreement with reference diagnoses provided by a panel of 3 dermatologists for PCPs and NPs. Secondary analyses included diagnostic accuracy for biopsy-confirmed cases, biopsy and referral rates, review time, and diagnostic confidence.
Results: Forty board-certified clinicians, including 20 PCPs (14 women [70.0%]; mean experience, 11.3 [range, 2-32] years) and 20 NPs (18 women [90.0%]; mean experience, 13.1 [range, 2-34] years) reviewed 1048 retrospective cases (672 female [64.2%]; median age, 43 [interquartile range, 30-56] years; 41 920 total reviews) from a teledermatology practice serving 11 sites and provided 0 to 5 differential diagnoses per case (mean [SD], 1.6 [0.7]). The PCPs were located across 12 states, and the NPs practiced in primary care without physician supervision across 9 states. The NPs had a mean of 13.1 (range, 2-34) years of experience and practiced in primary care without physician supervision across 9 states. Artificial intelligence assistance was significantly associated with higher agreement with reference diagnoses. For PCPs, the increase in diagnostic agreement was 10% (95% CI, 8%-11%; P < .001), from 48% to 58%; for NPs, the increase was 12% (95% CI, 10%-14%; P < .001), from 46% to 58%. In secondary analyses, agreement with biopsy-obtained diagnosis categories of maglignant, precancerous, or benign increased by 3% (95% CI, −1% to 7%) for PCPs and by 8% (95% CI, 3%-13%) for NPs. Rates of desire for biopsies decreased by 1% (95% CI, 0-3%) for PCPs and 2% (95% CI, 1%-3%) for NPs; the rate of desire for referrals decreased by 3% (95% CI, 1%-4%) for PCPs and NPs. Diagnostic agreement on cases not indicated for a dermatologist referral increased by 10% (95% CI, 8%-12%) for PCPs and 12% (95% CI, 10%-14%) for NPs, and median review time increased slightly by 5 (95% CI, 0-8) seconds for PCPs and 7 (95% CI, 5-10) seconds for NPs per case.
Conclusions and Relevance: Artificial intelligence assistance was associated with improved diagnoses by PCPs and NPs for 1 in every 8 to 10 cases, indicating potential for improving the quality of dermatologic care.
View details
Agreement Between Saliency Maps and Human-Labeled Regions of Interest: Applications to Skin Disease Classification
Singh Nalini
Kang Lee
Susan Huang
Aaron Loh
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2020)
Preview abstract
We propose to systematically identify potentially problematic patterns in skin disease classification models via quantitative analysis of agreement between saliency maps and human-labeled regions of interest. We further compute summary statistics describing patterns in this agreement for various stratifications of input examples. Through this analysis, we discover candidate spurious associations learned by the classifier and suggest next steps to handle such associations. Our approach can be used as a debugging tool to systematically spot difficult examples and error categories. Insights from this analysis could guide targeted data collection and improve model generalizability.
View details
A deep learning system for differential diagnosis of skin diseases
Clara Eng
David Way
Kang Lee
Peggy Bui
Kimberly Kanada
Guilherme de Oliveira Marinho
Jess Gallegos
Sara Gabriele
Vishakha Gupta
Nalini Singh
Lily Peng
Dennis Ai
Susan Huang
Carter Dunn
Nature Medicine (2020)
Preview abstract
Skin conditions affect 1.9 billion people. Because of a shortage of dermatologists, most cases are seen instead by general practitioners with lower diagnostic accuracy. We present a deep learning system (DLS) to provide a differential diagnosis of skin conditions using 16,114 de-identified cases (photographs and clinical data) from a teledermatology practice serving 17 sites. The DLS distinguishes between 26 common skin conditions, representing 80% of cases seen in primary care, while also providing a secondary prediction covering 419 skin conditions. On 963 validation cases, where a rotating panel of three board-certified dermatologists defined the reference standard, the DLS was non-inferior to six other dermatologists and superior to six primary care physicians (PCPs) and six nurse practitioners (NPs) (top-1 accuracy: 0.66 DLS, 0.63 dermatologists, 0.44 PCPs and 0.40 NPs). These results highlight the potential of the DLS to assist general practitioners in diagnosing skin conditions.
View details
Using a deep learning algorithm and integrated gradient explanation to assist grading for diabetic retinopathy
Ankur Taly
Anthony Joseph
Arjun Sood
Arun Narayanaswamy
Derek Wu
Ehsan Rahimy
Jesse Smith
Katy Blumer
Lily Peng
Michael Shumski
Scott Barb
Zahra Rastegar
Ophthalmology (2019)
Preview abstract
Background Deep learning methods have recently produced algorithms that can detect disease such as diabetic retinopathy (DR) with doctor-level accuracy. We sought to understand the impact of these models on physician graders in assisted-read settings.
Methods We surfaced model predictions and explanation maps ("masks") to 9 ophthalmologists with varying levels of experience to read 1,804 images each for DR severity based on the International Clinical Diabetic Retinopathy (ICDR) disease severity scale. The image sample was representative of the diabetic screening population, and was adjudicated by 3 retina specialists for a reference standard. Doctors read each image in one of 3 conditions: Unassisted, Grades Only, or Grades+Masks.
Findings Readers graded DR more accurately with model assistance than without (p < 0.001, logistic regression). Compared to the adjudicated reference standard, for cases with disease, 5-class accuracy was 57.5% for the model. For graders, 5-class accuracy for cases with disease was 47.5 ± 5.6% unassisted, 56.9 ± 5.5% with Grades Only, and 61.5 ± 5.5% with Grades+Mask. Reader performance improved with assistance across all levels of DR, including for severe and proliferative DR. Model assistance increased the accuracy of retina fellows and trainees above that of the unassisted grader or model alone. Doctors’ grading confidence scores and read times both increased overall with assistance. For most cases, Grades + Masks was as only effective as Grades Only, though masks provided additional benefit over grades alone in cases with: some DR and low model certainty; low image quality; and proliferative diabetic retinopathy (PDR) with features that were frequently missed, such as panretinal photocoagulation (PRP) scars.
Interpretation Taken together, these results show that deep learning models can improve the accuracy of, and confidence in, DR diagnosis in an assisted read setting.
View details
Preview abstract
Despite the recent success in applying supervised deep learning to medical imaging tasks, the problem of obtaining large, diverse and expert-annotated datasets required for development of high performant models remains particularly challenging. In this work, we explore the possibility of using Generative Adverserial Networks (GAN) to synthesize natural images with skin pathology. We propose DermGAN, an adaptation of the popular Pix2Pix architecture, to create synthetic images for a pre-specified skin condition, with varying size and location, and the underlying skin color. In a human turing test, we show that the synthetic images are not only visually similar to real images, but also embody the respective skin condition in dermatologists' eyes. Furthermore, when using synthetic images as a data augmentation technique for training a skin condition classifier, the model is non-inferior to the baseline model while demonstrating improved performance for rare conditions.
View details