Comparative Evaluation of Large Language and Multimodal Models in Detecting Spinal Stabilization Systems on X-Ray Images.

Journal: Journal of clinical medicine

Published Date: May 8, 2025

Abstract

Open-source AI models are increasingly applied in medical imaging, yet their effectiveness in detecting and classifying spinal stabilization systems remains underexplored. This study compares ChatGPT-4o (a large language model) and BiomedCLIP (a multimodal model) in their analysis of posturographic X-ray images (AP projection) to assess their accuracy in identifying the presence, type (growing vs. non-growing), and specific system (MCGR vs. PSF). A dataset of 270 X-ray images (93 without stabilization, 80 with MCGR, and 97 with PSF) was analyzed manually by neurosurgeons and evaluated using a three-stage AI-based questioning approach. Performance was assessed via classification accuracy, Gwet's Agreement Coefficient (AC1) for inter-rater reliability, and a two-tailed z-test for statistical significance ( < 0.05). The results indicate that GPT-4o demonstrates high accuracy in detecting spinal stabilization systems, achieving near-perfect recognition (97-100%) for the presence or absence of stabilization. However, its consistency is reduced when distinguishing complex growing-rod (MCGR) configurations, with agreement scores dropping significantly (AC1 = 0.32-0.50). In contrast, BiomedCLIP displays greater response consistency (AC1 = 1.00) but struggles with detailed classification, particularly in recognizing PSF (11% accuracy) and MCGR (4.16% accuracy). Sensitivity analysis revealed GPT-4o's superior stability in hierarchical classification tasks, while BiomedCLIP excelled in binary detection but showed performance deterioration as the classification complexity increased. These findings highlight GPT-4o's robustness in clinical AI-assisted diagnostics, particularly for detailed differentiation of spinal stabilization systems, whereas BiomedCLIP's precision may require further optimization to enhance its applicability in complex radiographic evaluations.

Authors

Bartosz Polis

Department of Neurosurgery, Polish-Mother's Memorial Hospital Research Institute, 93-338 Lodz, Poland.
Agnieszka Zawadzka-Fabijan

Department of Rehabilitation Medicine, Faculty of Health Sciences, Medical University of Lodz, 90-419 Lodz, Poland.
Robert Fabijan

Independent Researcher, Luton LU2 0GS, UK.
Róża Kosińska

Department of Neurosurgery, Polish-Mother's Memorial Hospital Research Institute, 93-338 Lodz, Poland.
Emilia Nowosławska

Department of Neurosurgery, Polish-Mother's Memorial Hospital Research Institute, 93-338 Lodz, Poland.
Artur Fabijan

Department of Neurosurgery, Polish-Mother's Memorial Hospital Research Institute, 93-338 Lodz, Poland.

Keywords

No keywords available for this article.

External Resources

View on PubMed Access via DOI PubMed (40429276)

Comparative Evaluation of Large Language and Multimodal Models in Detecting Spinal Stabilization Systems on X-Ray Images.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

Comparative Evaluation of Large Language and Multimodal Models in Detecting Spinal Stabilization Systems on X-Ray Images.

Abstract

Authors

Keywords

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals