Vision Transformers Based AI Models For Predicting Colorectal Cancer from Digital Pathology WSI: Use Case Of MHIST dataset

Journal: medRxiv
Published Date:

Abstract

This study investigates the efficacy of transformer-based deep learning architectures-specifically, Vision Transformer (ViT), Class Attention in Image Transformers (CaiT), and Data-Efficient Image Transformers (DeiT)-for the binary classification of colorectal polyps using the Minimalist Histopathology Image Analysis Dataset (MHIST). The dataset comprises 3,152 hematoxylin and eosin (H&E)-stained Formalin Fixed Paraffin-Embedded (FFPE) images annotated as either Hyperplastic Polyps (HP) or Sessile Serrated Adenomas (SSA). A rigorous evaluation was conducted using a 5-fold stratified cross-validation methodology, and performance was quantified using metrics including accuracy, precision, recall, F1-score, and AUC-ROC. Experimental results revealed that transformer architectures, particularly CaiT (accuracy of 90.18%, AUC-ROC of 95.52%), outperformed traditional convolutional neural networks (CNNs). The superior performance of CaiT is attributed to its specialized class-attention mechanisms, effectively capturing nuanced morphological differences essential for accurate histopathological classification. These findings underscore the potential of transformer-based models to enhance diagnostic precision, reduce variability in pathological assessment, and facilitate earlier and more reliable colorectal cancer screening.

Authors

  • Kondejkar
  • T.; Tunik
  • G.; Amal
  • S.