Zero-Shot Multi-modal Large Language Model v.s. Supervised Deep Learning: A Comparative Study on CT-Based Intracranial Hemorrhage Subtyping
Journal:
arXiv
Published Date:
May 14, 2025
Abstract
Introduction: Timely identification of intracranial hemorrhage (ICH) subtypes
on non-contrast computed tomography is critical for prognosis prediction and
therapeutic decision-making, yet remains challenging due to low contrast and
blurring boundaries. This study evaluates the performance of zero-shot
multi-modal large language models (MLLMs) compared to traditional deep learning
methods in ICH binary classification and subtyping. Methods: We utilized a
dataset provided by RSNA, comprising 192 NCCT volumes. The study compares
various MLLMs, including GPT-4o, Gemini 2.0 Flash, and Claude 3.5 Sonnet V2,
with conventional deep learning models, including ResNet50 and Vision
Transformer. Carefully crafted prompts were used to guide MLLMs in tasks such
as ICH presence, subtype classification, localization, and volume estimation.
Results: The results indicate that in the ICH binary classification task,
traditional deep learning models outperform MLLMs comprehensively. For subtype
classification, MLLMs also exhibit inferior performance compared to traditional
deep learning models, with Gemini 2.0 Flash achieving an macro-averaged
precision of 0.41 and a macro-averaged F1 score of 0.31. Conclusion: While
MLLMs excel in interactive capabilities, their overall accuracy in ICH
subtyping is inferior to deep networks. However, MLLMs enhance interpretability
through language interactions, indicating potential in medical imaging
analysis. Future efforts will focus on model refinement and developing more
precise MLLMs to improve performance in three-dimensional medical image
processing.