Automating Evaluation of AI Text Generation in Healthcare with a Large Language Model (LLM)-as-a-Judge.

Journal: medRxiv : the preprint server for health sciences

Published Date: May 6, 2025

Abstract

Electronic Health Records (EHRs) store vast amounts of clinical information that are difficult for healthcare providers to summarize and synthesize relevant details to their practice. To reduce cognitive load on providers, generative AI with Large Language Models have emerged to automatically summarize patient records into clear, actionable insights and offload the cognitive burden for providers. However, LLM summaries need to be precise and free from errors, making evaluations on the quality of the summaries necessary. While human experts are the gold standard for evaluations, their involvement is time-consuming and costly. Therefore, we introduce and validate an automated method for evaluating real-world EHR multi-document summaries using an LLM as the evaluator, referred to as LLM-as-a-Judge. Benchmarking against the validated Provider Documentation Summarization Quality Instrument (PDSQI)-9 for human evaluation , our LLM-as-a-Judge framework uses the PDSQI-9 rubric and demonstrated strong inter-rater reliability with human evaluators. GPT-o3-mini achieved the highest intraclass correlation coefficient of 0.818 (95% CI 0.772, 0.854), with a median score difference of 0 from human evaluators, and completes evaluations in just 22 seconds. Overall, the reasoning models excelled in inter-rater reliability, particularly in evaluations that require advanced reasoning and domain expertise, outperforming non-reasoning models, those trained on the task, and multi-agent workflows. Cross-task validation on the Problem Summarization task similarly confirmed high reliability. By automating high-quality evaluations, medical LLM-as-a-Judge offers a scalable, efficient solution to rapidly identify accurate and safe AI-generated summaries in healthcare settings.

Authors

Emma Croxford

Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI 53792, United States.
Yanjun Gao

Department of Biomedical Informatics, University of Colorado-Anschutz Medical, Aurora, CO 80045, United States.
Elliot First

Epic Systems, Verona, WI 53593, United States.
Nicholas Pellegrino

Epic Systems, Verona, WI 53593, United States.
Miranda Schnier

Epic Systems, Verona, WI 53593, United States.
John Caskey

Department of Medicine, University of Wisconsin, Madison, USA.
Madeline Oguss

Department of Medicine, University of Wisconsin, Madison, USA.
Graham Wills

UW Health, Madison, WI 53726, United States.
Guanhua Chen

Vanderbilt University School of Medicine, Nashville, TN.
Dmitriy Dligach

Department of Public Health Sciences, Stritch School of Medicine, Loyola University Chicago, Maywood, IL.
Matthew M Churpek

Department of Medicine, University of Wisconsin-Madison, Madison, WI, United States.
Anoop Mayampurath

Department of Medicine, University of Wisconsin-Madison, Madison, WI, United States.
Frank Liao

UW Health, Madison, WI 53726, United States.
Cherodeep Goswami

UW Health, Madison, WI 53726, United States.
Karen K Wong

Epic Systems, Verona, USA.
Brian W Patterson

UW Health, Madison, USA.
Majid Afshar

Loyola University Chicago, Chicago, IL.

Keywords

No keywords available for this article.

External Resources

View on PubMed Access via DOI PubMed (40313300)

Automating Evaluation of AI Text Generation in Healthcare with a Large Language Model (LLM)-as-a-Judge.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals