Understanding Uncertainty in Large Language Model Predictions of Early Death in Critically Ill Patients: A Conformal Prediction Approach

Journal: medRxiv
Published Date:

Abstract

Early prediction of in-hospital death remains a significant challenge due to the limited availability of structured data during initial admission. Unstructured clinical notes, which often contain important observations and impressions, are an underutilized resource for real-time risk stratification. While leveraging recent advances in large language models (LLM) is a promising approach to use this unstructured information, the lack of understanding of the uncertainty of LLM predictions, at the patient level, for such critical forecasts is a serious deterrence for their use in clinical settings. This study aims to evaluate the effectiveness and confidence, in predicting in-hospital death probability for an individual patient using LLMs, specifically GPT-4o and unstructured clinical notes. We applied conformal prediction to quantify the uncertainty of GPT-4o’s zero-shot predictions for in-hospital death, leveraging concatenated clinical notes documented from the first 24 hours of intensive care unit (ICU) admission in MIMIC-III for patients with acute kidney failure who were admitted through the emergency department (ED). Across both classes “in-hospital death” and “in-hospital survive”, the GPT model performed better on the in-hospital death class, achieving precision 0.52 (95% CI 0.48–0.56), recall 0.93 (95% CI 0.90–0.95), and F1-score 0.66 (95% CI 0.63– 0.70). The conformal prediction (CP) framework provided an overall empirical coverage of 90.4%, exceeding the target threshold of 90%. However, class-specific coverage was imbalanced, with 99.7% for the death and 81.1% for the survived class. The model’s outputs exhibit overconfidence, particularly in cases of incorrect predictions. Integrating conformal prediction provides a promising approach to quantifying and calibrating uncertainty in large language model outputs for individual patient predictions, thereby enhancing their potential applicability for clinical decision-making.

Authors

  • Fatemeh Shah-Mohammadi; Alexander Millar; Julio Facelli; Ramkiran Gouripeddi