MSTNet: Multi-scale spatial-aware transformer with multi-instance learning for diabetic retinopathy classification.
Journal:
Medical image analysis
PMID:
40020421
Abstract
Diabetic retinopathy (DR), the leading cause of vision loss among diabetic adults worldwide, underscores the importance of early detection and timely treatment using fundus images to prevent vision loss. However, existing deep learning methods struggle to capture the correlation and contextual information of subtle lesion features with the current scale of dataset. To this end, we propose a novel Multi-scale Spatial-aware Transformer Network (MSTNet) for DR classification. MSTNet encodes information from image patches at varying scales as input features, constructing a dual-pathway backbone network comprised of two Transformer encoders of different sizes to extract both local details and global context from images. To fully leverage structural prior knowledge, we introduce a Spatial-aware Module (SAM) to capture spatial local information within the images. Furthermore, considering the differences between medical and natural images, specifically that regions of interest in medical images often lack distinct subjectivity and continuity, we employ a Multiple Instance Learning (MIL) strategy to aggregate features from diverse regions, thereby enhancing correlation to subtle lesion areas. Ultimately, a cross-fusion classifier integrates dual-pathway features to produce the final classification result. We evaluate MSTNet on four public DR datasets, including APTOS2019, RFMiD2020, Messidor, and IDRiD. Extensive experiments demonstrate that MSTNet exhibits superior diagnostic and grading accuracy, achieving improvements of up to 2.0% in terms of ACC and 1.2% in terms of F1 score, highlighting its effectiveness in accurately assessing fundus images.