Performance of Vision-Enabled Large Language Models in Image-Based Electrocardiogram Interpretation: Exploratory Evaluation.
Journal:
Journal of medical Internet research
Published Date:
Jun 3, 2026
Abstract
BACKGROUND: Vision-enabled large language models (VE-LLMs) have the potential to provide flexible and explainable medical image interpretation. However, their real-world performance on clinical data, such as 12-lead electrocardiograms (ECGs), has not been systematically assessed. OBJECTIVE: This study aimed to evaluate the diagnostic accuracy and reliability of state-of-the-art generalist VE-LLMs in interpreting real-world ECG images. METHODS: We tested 6 generalist VE-LLMs (ChatGPT-5, ChatGPT-4, Gemini 2.5, Copilot, Claude Sonnet-4, and Claude Opus-4.1) using 70 deidentified ECG images. A standardized prompt requested 9 determinations: rhythm, first-degree atrioventricular (AV) block, intraventricular conduction block and pattern, corrected QT (QTc) prolongation, premature atrial and ventricular contractions, ischemic ST-segment deviation, and axis deviation. An expert consensus served as the reference standard. Moreover, 2 image-based ECG-specialized LLMs (PULSE-7B and ECG-Instruct-Llama-3.2-11B-Vision) were tested for exploratory comparison. Model outputs were evaluated using overall and per-category diagnostic metrics. RESULTS: Overall balanced accuracy across generalist models ranged from 50.1% to 61.8% (Cochran Q, P<.001). ChatGPT-5 achieved the highest balanced accuracy (61.8%) but had the slowest response time (median 276, IQR 110-407 s), whereas Copilot responded within a median of 3 (IQR 2-4) seconds. Balanced accuracy for rhythm classification ranged from 38.6% to 55.8%, but sensitivity for atrial fibrillation among generalist models was ≤11.1%, detecting either none or only 1 of the 9 cases. Detection of first-degree AV block (sensitivity 0%-22%; 0/9 to 2/9) and QTc prolongation (sensitivity 0%-45.5%; 0/22 to 10/22) was poor. Intraventricular block was identified with up to 67.8% balanced accuracy, but correct subtype assignment was ≤44% (≤11/25). ST-segment deviation sensitivity was <25% for all generalist models (highest 3/14). Agreement with expert interpretation was low, with Cohen κ indicating poor-to-fair concordance (κ≤0.39). Specialized models achieved overall balanced accuracy of 56.5% (ECG-Instruct-Llama-3.2-11B-Vision) and 64.4% (PULSE-7B), with PULSE-7B showing higher task-specific balanced accuracy in rhythm classification and ectopic beats detection (up to 86.3% and 89.2%, respectively). CONCLUSIONS: VE-LLMs showed moderate overall performance but mostly low sensitivity and limited agreement with expert ECG interpretation. Current performance remains inconsistent across models and diagnostic categories and is insufficient to support clinical deployment.
Authors
Keywords
No keywords available for this article.