Development and validation of a lesion-supervised deep learning system for diabetic retinopathy grading according to UK national screening criteria
Journal:
medRxiv
Published Date:
Apr 28, 2026
Abstract
Background: Diabetic retinopathy (DR) is the leading cause of preventable blindness among working-age adults worldwide, yet screening coverage remains inadequate, particularly in low- and middle-income countries. Automated deep learning systems offer potential to address the global shortage of expert graders, but most existing models lack lesion-level interpretability and are not aligned with established clinical referral frameworks. We developed and validated DRAGS (Diabetic Retinopathy Automated Grading System), a hybrid deep learning model that grades DR according to the UK Diabetic Eye Screening Programme (DESP) classification and provides lesion-level explainability. Methods: We trained and validated a DenseNet-201-based convolutional neural network on 20,281 anonymised fundus images from two tertiary eye care institutions in Bangladesh. Images were graded by fellowship-trained retinal specialists using the UK DESP framework, resulting in 10 clinically interpretable classes that combine retinopathy grade (R0-R3) and maculopathy status (M0/M1). A companion dataset of 2,936 pixel-level lesion masks spanning nine pathological categories was used to train a parallel multi-label lesion-detection head. The dataset was partitioned 70:15:15 (patient-stratified). Performance was evaluated using macro-averaged AUROC (DeLong estimator), sensitivity, specificity, F1 score, quadratically weighted Cohen's kappa, and expected calibration error (ECE), with 95% CIs from 2000 bootstrap resamples. Grad-CAM spatial alignment with ground-truth lesion masks was assessed using Dice and IoU. This study follows the TRIPOD+AI reporting guidelines. Findings: On the held-out test set (Component I: n = 3,044; Component II: n ~ 440), DRAGS achieved class-wise precision, recall, and F1 scores ranging from 0.88 to 0.99 across all ten UK DESP grades, with advanced proliferative stages (R3-M0, R3-M1) consistently exceeding 0.95. Overall accuracy was approximately 91.1% and quadratically weighted Cohen's kappa was approximately 0.90. For referable versus non-referable DR, sensitivity was 90.7% and specificity was 91.9%. The companion lesion-detection head achieved macro-averaged sensitivity of 93.9%, specificity of 99.5%, and AUC of 0.997 across nine lesion classes; seven of nine classes achieved AUC = 1.00. Grad-CAM activations showed progressive spatial shift from diffuse (normal) to lesion-dense peripheral patterns (proliferative DR), with maximal agreement for microaneurysms and exudates. Mean inference time was 110-160 ms per image. Interpretation: DRAGS demonstrates high diagnostic accuracy for ten-class UK DESP-aligned DR grading, with clinically interpretable lesion-level explainability on a large real-world LMIC dataset. External validation and prospective clinical evaluation are warranted before deployment.