Vision Language Models versus Machine Learning Models Performance on Polyp Detection and Classification in Colonoscopy Images
Journal:
arXiv
Published Date:
Mar 27, 2025
Abstract
Introduction: This study provides a comprehensive performance assessment of
vision-language models (VLMs) against established convolutional neural networks
(CNNs) and classic machine learning models (CMLs) for computer-aided detection
(CADe) and computer-aided diagnosis (CADx) of colonoscopy polyp images. Method:
We analyzed 2,258 colonoscopy images with corresponding pathology reports from
428 patients. We preprocessed all images using standardized techniques
(resizing, normalization, and augmentation) and implemented a rigorous
comparative framework evaluating 11 distinct models: ResNet50, 4 CMLs (random
forest, support vector machine, logistic regression, decision tree), two
specialized contrastive vision language encoders (CLIP, BiomedCLIP), and three
general-purpose VLMs ( GPT-4 Gemini-1.5-Pro, Claude-3-Opus). Our performance
assessment focused on two clinical tasks: polyp detection (CADe) and
classification (CADx). Result: In polyp detection, ResNet50 achieved the best
performance (F1: 91.35%, AUROC: 0.98), followed by BiomedCLIP (F1: 88.68%,
AUROC: [AS1] ). GPT-4 demonstrated comparable effectiveness to traditional
machine learning approaches (F1: 81.02%, AUROC: [AS2] ), outperforming other
general-purpose VLMs. For polyp classification, performance rankings remained
consistent but with lower overall metrics. ResNet50 maintained the highest
efficacy (weighted F1: 74.94%), while GPT-4 demonstrated moderate capability
(weighted F1: 41.18%), significantly exceeding other VLMs (Claude-3-Opus
weighted F1: 25.54%, Gemini 1.5 Pro weighted F1: 6.17%). Conclusion: CNNs
remain superior for both CADx and CADe tasks. However, VLMs like BioMedCLIP and
GPT-4 may be useful for polyp detection tasks where training CNNs is not
feasible.