Enhancing Visual Inspection Capability of Multi-Modal Large Language Models on Medical Time Series with Supportive Conformalized and Interpretable Small Specialized Models
Journal:
arXiv
Published Date:
Jan 27, 2025
Abstract
Large language models (LLMs) exhibit remarkable capabilities in visual
inspection of medical time-series data, achieving proficiency comparable to
human clinicians. However, their broad scope limits domain-specific precision,
and proprietary weights hinder fine-tuning for specialized datasets. In
contrast, small specialized models (SSMs) excel in targeted tasks but lack the
contextual reasoning required for complex clinical decision-making. To address
these challenges, we propose ConMIL (Conformalized Multiple Instance Learning),
a decision-support SSM that integrates seamlessly with LLMs. By using Multiple
Instance Learning (MIL) to identify clinically significant signal segments and
conformal prediction for calibrated set-valued outputs, ConMIL enhances LLMs'
interpretative capabilities for medical time-series analysis. Experimental
results demonstrate that ConMIL significantly improves the performance of
state-of-the-art LLMs, such as ChatGPT4.0 and Qwen2-VL-7B. Specifically,
\ConMIL{}-supported Qwen2-VL-7B achieves 94.92% and 96.82% precision for
confident samples in arrhythmia detection and sleep staging, compared to
standalone LLM accuracy of 46.13% and 13.16%. These findings highlight the
potential of ConMIL to bridge task-specific precision and broader contextual
reasoning, enabling more reliable and interpretable AI-driven clinical decision
support.