Comparative Analysis of Deep Learning Strategies for Hypertensive Retinopathy Detection from Fundus Images: From Scratch and Pre-trained Models
Journal:
arXiv
Published Date:
Jun 14, 2025
Abstract
This paper presents a comparative analysis of deep learning strategies for
detecting hypertensive retinopathy from fundus images, a central task in the
HRDC challenge~\cite{qian2025hrdc}. We investigate three distinct approaches: a
custom CNN, a suite of pre-trained transformer-based models, and an AutoML
solution. Our findings reveal a stark, architecture-dependent response to data
augmentation. Augmentation significantly boosts the performance of pure Vision
Transformers (ViTs), which we hypothesize is due to their weaker inductive
biases, forcing them to learn robust spatial and structural features.
Conversely, the same augmentation strategy degrades the performance of hybrid
ViT-CNN models, whose stronger, pre-existing biases from the CNN component may
be "confused" by the transformations. We show that smaller patch sizes
(ViT-B/8) excel on augmented data, enhancing fine-grained detail capture.
Furthermore, we demonstrate that a powerful self-supervised model like DINOv2
fails on the original, limited dataset but is "rescued" by augmentation,
highlighting the critical need for data diversity to unlock its potential.
Preliminary tests with a ViT-Large model show poor performance, underscoring
the risk of using overly-capacitive models on specialized, smaller datasets.
This work provides critical insights into the interplay between model
architecture, data augmentation, and dataset size for medical image
classification.