Fusion of Foundation and Vision Transformer Model Features for Dermatoscopic Image Classification
Journal:
arXiv
Published Date:
May 22, 2025
Abstract
Accurate classification of skin lesions from dermatoscopic images is
essential for diagnosis and treatment of skin cancer. In this study, we
investigate the utility of a dermatology-specific foundation model, PanDerm, in
comparison with two Vision Transformer (ViT) architectures (ViT base and Swin
Transformer V2 base) for the task of skin lesion classification. Using frozen
features extracted from PanDerm, we apply non-linear probing with three
different classifiers, namely, multi-layer perceptron (MLP), XGBoost, and
TabNet. For the ViT-based models, we perform full fine-tuning to optimize
classification performance. Our experiments on the HAM10000 and MSKCC datasets
demonstrate that the PanDerm-based MLP model performs comparably to the
fine-tuned Swin transformer model, while fusion of PanDerm and Swin Transformer
predictions leads to further performance improvements. Future work will explore
additional foundation models, fine-tuning strategies, and advanced fusion
techniques.