Self-supervised learning framework for efficient classification of endoscopic images using pretext tasks.

Journal: PloS one
PMID:

Abstract

Identifying anatomical landmarks in endoscopic video frames is essential for the early diagnosis of gastrointestinal diseases. However, this task remains challenging due to variability in visual characteristics across different regions and the limited availability of annotated data. In this study, we propose a novel self-supervised learning (SSL) framework that integrates three complementary pretext task, colorization, jigsaw puzzle solving, and patch prediction, to enhance feature learning from unlabeled endoscopic images. By leveraging these tasks, our model extracts rich, meaningful representations, improving the downstream classification of Z-line, esophageal, and antrum/pylorus regions. To further enhance feature extraction and model interpretability, we incorporate attention mechanisms, transformer-based architectures, and Grad-CAM visualization. The integration of attention layers and transformers strengthens the model's ability to learn discriminative and generalizable features, while Grad-CAM improves explainability by highlighting critical decision-making regions. These enhancements make our approach more suitable for clinical deployment, ensuring both high accuracy and interpretability. We evaluate our proposed framework on a comprehensive dataset, demonstrating substantial improvements in classification accuracy, precision, recall, and F1-score compared to conventional models trained without SSL. Specifically, our combined model achieves a classification accuracy of 98%, with high precision and recall across all classes, as reflected in ROC curves and confusion matrices. These results underscore the effectiveness of pretext-task-based SSL, attention mechanism, and transformers for anatomical landmark identification in endoscopic video frames. Our work introduces a scalable and interpretable methodology for improving endoscopic image classification, reducing reliance on large annotated datasets while enhancing model performance in real-world clinical applications. Future research will explore validation on diverse datasets, real-time diagnostic integration, and scalability to further advance medical image analysis using SSL.

Authors

  • Shima Ayyoubi Nezhad
    School of Industrial and Systems Engineering, Tarbiat Modares University (TMU), Tehran, Iran.
  • Golnaz Tajeddin
    School of Industrial and Systems Engineering, Tarbiat Modares University (TMU), Tehran, Iran.
  • Toktam Khatibi
    Faculty of Industrial and Systems Engineering, Tarbiat Modares University, Tehran 1411713116, Iran. Electronic address: toktam.khatibi@modares.ac.ir.
  • Masoudreza Sohrabi
    Gastrointestinal and Liver Diseases Research Center, Iran University of Medical Sciences (IUMS), Tehran, Iran.