Google Research

Deep learning versus human graders for classifying diabetic retinopathy severity in a nationwide screening program

  • Dr. Paisan Raumviboonsuk
  • Jonathan Krause
  • Dr. Peranut Chotcomwongse
  • Rory Abbott Sayres
  • Rajiv Raman
  • Kasumi Widner
  • Bilson Campana
  • Sonia Phene
  • Kornwipa Hemarat
  • Mongkol Tadarati
  • Sukhum Silpa-Archa
  • Jirawut Limwattanayingyong
  • Chetan Rao
  • Oscar Kuruvilla
  • Jesse Jung
  • Jeffrey Tan
  • Surapong Orprayoon
  • Chawawat Kangwanwongpaisan
  • Ramase Sukumalpaiboon
  • Chainarong Luengchaichawang
  • Jitumporn Fuangkaew
  • Pipat Kongsap
  • Lamyong Chualinpha
  • Sarawuth Saree
  • Srirut Kawinpanitan
  • Korntip Mitvongsa
  • Siriporn Lawanasakol
  • Chaiyasit Thepchatri
  • Lalita Wongpichedchai
  • Greg Corrado
  • Lily Peng
  • Dale Webster
Nature Partner Journal (npj) Digital Medicine (2019)


Deep learning algorithms have been used to detect diabetic retinopathy (DR) with specialist-level accuracy. This study aims to validate one such algorithm on a large-scale clinical population, and compare the algorithm performance with that of human graders. A total of 25,326 gradable retinal images of patients with diabetes from the community-based, nationwide screening program of DR in Thailand were analyzed for DR severity and referable diabetic macular edema (DME). Grades adjudicated by a panel of international retinal specialists served as the reference standard. Relative to human graders, for detecting referable DR (moderate NPDR or worse), the deep learning algorithm had significantly higher sensitivity (0.97 vs. 0.74, p < 0.001), and a slightly lower specificity (0.96 vs. 0.98, p < 0.001). Higher sensitivity of the algorithm was also observed for each of the categories of severe or worse NPDR, PDR, and DME (p < 0.001 for all comparisons). The quadratic-weighted kappa for determination of DR severity levels by the algorithm and human graders was 0.85 and 0.78 respectively (p < 0.001 for the difference). Across different severity levels of DR for determining referable disease, deep learning significantly reduced the false negative rate (by 23%) at the cost of slightly higher false positive rates (2%). Deep learning algorithms may serve as a valuable tool for DR screening.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work