Automated Evaluation of Large Language Model Response Concordance with Human Specialist Responses on Physician-to-Physician eConsult Cases

Journal: medRxiv
Published Date:

Abstract

Specialist consults in primary care and inpatient settings typically address complex clinical questions beyond standard guidelines. eConsults have been developed as a way for specialist physicians to review cases asynchronously and provide clinical answers without a formal patient encounter. Meanwhile, large language models (LLMs) have approached human-level performance on structured clinical tasks, but their real-world effectiveness requires evaluation, which is bottlenecked by time-intensive manual physician review. To address this, we evaluate two automated methods: LLM-as-judge and a decompose-then-verify framework that breaks down AI answers into verifiable claims against human eConsult responses. Using 40 real-world physician-to-physician eConsults, we compared AI-generated responses to human answers using both physician raters and automated tools. LLM-as-judge outperformed decompose-then-verify, achieving human-level concordance assessment with F1-score of 0.89 (95% CI: 0.750, 0.960) and Cohen’s kappa of 0.75 (95% CI 0.47,0.90) —comparable to physician inter-rater agreement κ = 0.69-0.90 (95% CI 0.43-1.0).

Authors

  • David JH Wu; Fateme Nateghi Haredasht; David Wu; Vishnu Ravi; Liam G. McCoy; Yingjie Weng; Kanav Chopra; Selin S. Everett; George Nageeb; Wenyuan Chen; Stephen P. Ma; Saloni Kumar Maharaj; Jessica Tran; Leah Rosengaus; Lena Giang; Olivia Jee; Ethan Goh; Jonathan H Chen