Deep learning versus human graders for classifying diabetic retinopathy severity in a nationwide screening program

Dr. Paisan Raumviboonsuk

Jonathan Krause

Dr. Peranut Chotcomwongse

Rory Abbott Sayres

Rajiv Raman

Kasumi Widner

Bilson Campana

Sonia Phene

Kornwipa Hemarat

Mongkol Tadarati

Sukhum Silpa-Archa

Jirawut Limwattanayingyong

Chetan Rao

Oscar Kuruvilla

Jesse Jung

Jeffrey Tan

Surapong Orprayoon

Chawawat Kangwanwongpaisan

Ramase Sukumalpaiboon

Chainarong Luengchaichawang

Jitumporn Fuangkaew

Pipat Kongsap

Lamyong Chualinpha

Sarawuth Saree

Srirut Kawinpanitan

Korntip Mitvongsa

Siriporn Lawanasakol

Chaiyasit Thepchatri

Lalita Wongpichedchai

Greg Corrado

Lily Peng

Dale Webster

Nature Partner Journal (npj) Digital Medicine (2019)

Download Google Scholar

Abstract

Deep learning algorithms have been used to detect diabetic retinopathy (DR) with specialist-level accuracy. This study aims to validate one such algorithm on a large-scale clinical population, and compare the algorithm performance with that of human graders. A total of 25,326 gradable retinal images of patients with diabetes from the community-based, nationwide screening program of DR in Thailand were analyzed for DR severity and referable diabetic macular edema (DME). Grades adjudicated by a panel of international retinal specialists served as the reference standard. Relative to human graders, for detecting referable DR (moderate NPDR or worse), the deep learning algorithm had significantly higher sensitivity (0.97 vs. 0.74, p < 0.001), and a slightly lower specificity (0.96 vs. 0.98, p < 0.001). Higher sensitivity of the algorithm was also observed for each of the categories of severe or worse NPDR, PDR, and DME (p < 0.001 for all comparisons). The quadratic-weighted kappa for determination of DR severity levels by the algorithm and human graders was 0.85 and 0.78 respectively (p < 0.001 for the difference). Across different severity levels of DR for determining referable disease, deep learning significantly reduced the false negative rate (by 23%) at the cost of slightly higher false positive rates (2%). Deep learning algorithms may serve as a valuable tool for DR screening.

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Deep learning versus human graders for classifying diabetic retinopathy severity in a nationwide screening program

Abstract

Research Areas

Learn more about how we conduct our research

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Deep learning versus human graders for classifying diabetic retinopathy severity in a nationwide screening program

Abstract

Research Areas

Learn more about how we conduct our research

AI/ML Foundations  & Capabilities