Pre-trained Vision Transformer With Masked Autoencoder for Automated Diabetic Macular Edema Detection from Optical Coherence Tomography Images
Journal:
medRxiv
Published Date:
Jan 1, 2025
Abstract
To develop and evaluate a novel self-supervised learning approach using Masked Autoencoder (MAE) pre-trained Vision Transformer (ViT) for automated detection of diabetic macular edema (DME) from optical coherence tomography (OCT) images, addressing the critical need for scalable screening solutions in diabetic eye care. Artificial intelligence model training. We utilized the publicly available Kermany dataset containing 109,312 OCT images, defining DME detection as a binary classification task (11,559 DME vs. 97,753 non-DME images). Five deep learning architectures were compared: MAE-pretrained ViT (MAE_ViT), standard ViT, ResNet18, VGG19_bn, and EfficientNetV2. MAE_ViT underwent two-stage training: (1) self-supervised pre-training with 75% patch masking for 1,000 epochs to learn robust visual representations, and (2) supervised fine-tuning for DME classification. Model performance was evaluated using accuracy, sensitivity, specificity, F1 score, and area under the receiver operating characteristic curve (AU-ROC) with 95% confidence intervals calculated via bootstrap resampling. MAE_ViT achieved superior performance with AU-ROC 0.999 (95% CI: 0.999-1.000), accuracy 98.5% (95% CI: 97.7-99.2%), sensitivity 99.6% (95% CI: 98.7-100%), and specificity 98.1% (95% CI: 97.2-99.1%). VGG19_bn showed the second-best performance (AU-ROC 0.997), while ResNet18 demonstrated poor specificity (28.3%) despite perfect sensitivity. The self-supervised approach of MAE_ViT outperformed standard supervised ViT (AU-ROC 0.995), demonstrating the effectiveness of learning from unlabeled data. MAE pre-trained Vision Transformer establishes a new benchmark for automated DME detection, offering exceptional diagnostic accuracy and potential for deployment in resource-constrained settings through reduced annotation requirements.