Large language models for automatable real-world performance monitoring of diagnostic decision support systems: a comparison to manual doctor panel review in a prospective clinical study

Journal: medRxiv

Published Date: Jan 1, 2025

Abstract

Diagnostic decision support systems (DDSS) are increasingly deployed at scale, yet their diagnostic accuracy is insufficiently monitored once integrated into care. Traditional post-market surveillance relies on clinician review, which is costly, slow, and difficult to sustain. Large language models (LLMs) may offer a scalable and potentially automatable solution, but their performance in real-world monitoring remains unknown. We conducted a diagnostic accuracy substudy within ESSENCE, a prospective evaluation of Ada Health’s DDSS integrated into Portugal’s largest private healthcare network. Clinical notes and ICD-10 diagnoses from 498 encounters were anonymised and classified using a filter–map–match framework. Manual clinician review served as the reference standard. We compared eligibility classification and condition mapping between clinicians and GPT-4.1 and GPT-5, and assessed diagnostic accuracy of two DDSS versions using both reference sets. Manual review classified 385 of 498 encounters (77·3%) as eligible for diagnostic comparison. GPT-5 reproduced these classifications with 84·7% accuracy (κ=0·57), showing high sensitivity but only moderate specificity. Among 347 encounters judged eligible by both approaches, GPT-5 exactly matched clinician-assigned diagnoses in 93·6% and proposed clinically plausible alternatives in 3·5%. Diagnostic accuracy estimates based on manual versus GPT-5 mappings were statistically indistinguishable at Top-1 and Top-3 across the full analyzable sets, with one significant difference at Top-5. In the overlapping 346 cases, no statistical differences were observed. Across both reference sets, the experimental DDSS version outperformed the original only at the Top-5 threshold. LLMs can reproduce clinician review of real-world diagnostic encounters with close agreement. While GPT-5 performed comparably to clinicians for condition mapping, the eligibility filtering step - deciding which encounters should enter the diagnostic-accuracy analysis - remains the main source of divergence and is the priority for improvement. Embedding such approaches into health systems could enable automated and continuous performance and safety monitoring and support regulatory compliance. Broader evaluations across diverse care settings are needed to establish generalisability and equity impact. German Federal Ministry of Education and Research (NextGenerationEU, PATH project).

Authors

Fabienne Cotte; Marcel Schmude; Philipp Bode; Oula Suliman; Filipa Dias Lourenço; Miguel Paiva Pereira; Nisha Kini; Vera Hartenstein; Allesandro Muscoloni; Lisa Stroux; Victor Hertz; Sebastian Köhler; Valerio Morelli; Henry Hoffmann; Peter Engerer; Stephen Gilbert; Kirsten Gray; Tauseef Mehrali; Micaela Seemann Monteiro; Pedro Flores

External Resources

View on medRxiv Access via DOI

Large language models for automatable real-world performance monitoring of diagnostic decision support systems: a comparison to manual doctor panel review in a prospective clinical study

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

Large language models for automatable real-world performance monitoring of diagnostic decision support systems: a comparison to manual doctor panel review in a prospective clinical study

Abstract

Authors

Categories

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals