Hybrid deep learning framework based on EfficientViT for classification of gastrointestinal diseases.
Journal:
Scientific reports
Published Date:
Jul 24, 2025
Abstract
GI diseases are one of the leading causes of morbidity and mortality worldwide, and early and accurate diagnosis is considered to be very important. Traditional methods like endoscopy take time and depend majorly on the judgment of the physician. The proposed Efficient Vision Transformer (EfficientViT) is a new deep learning-based model using EfficientNetB0 in combination with the Vision Transformer (ViT) for the classification of eight different types of diseases in the GI system. EfficientViT utilizes the features of EfficientNetB0 to capture local textures and multi-scale features to achieve structural changes in the GI tract. At the same time, it includes the capacity of the ViT model to recognize the context of images of the GI tract for the detection of slight disease patterns and precursors of disease diffusion. Furthermore, we designed a dual-block in which input is divided into two parts (q1, q2) to better optimize the model q1 processed through an EfficientNet for local details and a q2 through encoder block for capturing the global dependencies, which enables EfficientViT to pay attention to multiple image regions simultaneously. We have tested the model using fivefold cross-validation and achieved an outstanding accuracy of 99.82% compared to the MobileNetV2-based model which reached 99.60%. In addition, EfficientViT demonstrated excellent precision, recall, and F1 scores. Our model, in general, outperforms existing methods, offering a promising tool for clinicians to more reliably and accurately diagnose GI diseases from endoscopic images.