From Clinical Judgment to Large Language Models: Benchmarking Predictive Approaches for Unplanned Hospital Admissions

Journal: medRxiv
Published Date:

Abstract

While machine learning (ML) models show strong performance for predicting unplanned hospital visits, their clinical utility relative to physician judgment remains unclear. Large language models (LLMs) offer a promising middle ground, potentially combining algorithmic accuracy with human-interpretable reasoning. To directly compare the predictive performance of physicians, structured ML models, and LLMs for forecasting 30-day emergency department (ED) visits and unplanned hospital admissions under equivalent data conditions. We selected 404 cases from structured EHR data and converted them into synthetic clinical vignettes using GPT-5. Thirty-five physicians evaluated these vignettes, while CLMBR-T (a machine learning model trained on structured EHR data) was applied to the original data. Eight LLMs evaluated the same vignettes. We compared discriminative performance (AUROC, AUPRC), calibration (Brier score, Expected Calibration Error), and confidence-performance relationships across all methods. CLMBR-T achieved the highest discriminative performance (AUROC 0.79, 95% CI: 0.75-0.83; AUPRC 0.78, 95% CI: 0.72-0.83), followed by large LLMs (DeepSeek V3, Claude 4.1 Opus, GPT-5; AUROC 0.74). Pooled physicians performed lowest (AUROC 0.65, 95% CI: 0.59-0.70; AUPRC 0.61, 95% CI: 0.54-0.68). However, LLMs showed stronger alignment with physician reasoning (correlation r=0.51-0.65) compared to CLMBR-T (r=0.37). CLMBR-T demonstrated superior confidence calibration with significant confidence-performance correlation (r=0.21, p<0.001), while physicians showed poor calibration (r=0.07, p=0.17). Individual physician performance varied widely (AUROC 0.55-0.83), with three out of 35 physicians exceeding the ML benchmark. ML models trained on structured EHR data outperform both physicians and LLMs in predictive accuracy and confidence calibration, though LLMs achieved competitive zero-shot performance and better approximated human clinical reasoning. These findings suggest hybrid approaches combining high-performance ML screening with interpretable LLM explanations may optimize both accuracy and clinical adoption. The substantial variability in physician performance highlights limitations of benchmarking against “average” clinical judgment.

Authors

  • Bernardo Neves; Mário J. Silva