Large language models for automatable real-world performance monitoring of diagnostic decision support systems: a comparison to manual doctor panel review in a prospective clinical study
Journal:
medRxiv
Published Date:
Jan 1, 2025
Abstract
Diagnostic decision support systems (DDSS) are increasingly deployed at scale, yet their diagnostic accuracy is insufficiently monitored once integrated into care. Traditional post-market surveillance relies on clinician review, which is costly, slow, and difficult to sustain. Large language models (LLMs) may offer a scalable and potentially automatable solution, but their performance in real-world monitoring remains unknown. We conducted a diagnostic accuracy substudy within ESSENCE, a prospective evaluation of Ada Health’s DDSS integrated into Portugal’s largest private healthcare network. Clinical notes and ICD-10 diagnoses from 498 encounters were anonymised and classified using a filter–map–match framework. Manual clinician review served as the reference standard. We compared eligibility classification and condition mapping between clinicians and GPT-4.1 and GPT-5, and assessed diagnostic accuracy of two DDSS versions using both reference sets. Manual review classified 385 of 498 encounters (77·3%) as eligible for diagnostic comparison. GPT-5 reproduced these classifications with 84·7% accuracy (κ=0·57), showing high sensitivity but only moderate specificity. Among 347 encounters judged eligible by both approaches, GPT-5 exactly matched clinician-assigned diagnoses in 93·6% and proposed clinically plausible alternatives in 3·5%. Diagnostic accuracy estimates based on manual versus GPT-5 mappings were statistically indistinguishable at Top-1 and Top-3 across the full analyzable sets, with one significant difference at Top-5. In the overlapping 346 cases, no statistical differences were observed. Across both reference sets, the experimental DDSS version outperformed the original only at the Top-5 threshold. LLMs can reproduce clinician review of real-world diagnostic encounters with close agreement. While GPT-5 performed comparably to clinicians for condition mapping, the eligibility filtering step - deciding which encounters should enter the diagnostic-accuracy analysis - remains the main source of divergence and is the priority for improvement. Embedding such approaches into health systems could enable automated and continuous performance and safety monitoring and support regulatory compliance. Broader evaluations across diverse care settings are needed to establish generalisability and equity impact. German Federal Ministry of Education and Research (NextGenerationEU, PATH project).