Development of a Human Evaluation Framework and Correlation with Automated Metrics for Natural Language Generation of Medical Diagnoses.

Journal: AMIA ... Annual Symposium proceedings. AMIA Symposium

Published Date: May 22, 2025

Abstract

In the evolving landscape of clinical Natural Language Generation (NLG), assessing abstractive text quality remains challenging, as existing methods often overlook generative task complexities. This work aimed to examine the current state of automated evaluation metrics in NLG in healthcare. To have a robust and well-validated baseline with which to examine the alignment of these metrics, we created a comprehensive human evaluation framework. Employing ChatGPT-3.5-turbo generative output, we correlated human judgments with each metric. None of the metrics demonstrated high alignment; however, the SapBERT score-a Unified Medical Language System (UMLS)- showed the best results. This underscores the importance of incorporating domain-specific knowledge into evaluation efforts. Our work reveals the deficiency in quality evaluations for generated text and introduces our comprehensive human evaluation framework as a baseline. Future efforts should prioritize integrating medical knowledge databases to enhance the alignment of automated metrics, particularly focusing on refining the SapBERT score for improved assessments.

Authors

Emma Croxford

Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI 53792, United States.
Yanjun Gao

Department of Biomedical Informatics, University of Colorado-Anschutz Medical, Aurora, CO 80045, United States.
Brian Patterson

UW Health, Madison, WI 53726, United States.
Daniel To

Health Sciences Division, Burn and Shock Trauma Research Institute, Stritch School of Medicine, Loyola University, Maywood, Illinois, USA.
Samuel Tesch

School of Medicine and Public Health, University of Wisconsin, Madison, Wisconsin, USA.
Dmitriy Dligach

Department of Public Health Sciences, Stritch School of Medicine, Loyola University Chicago, Maywood, IL.
Anoop Mayampurath

Department of Medicine, University of Wisconsin-Madison, Madison, WI, United States.
Matthew M Churpek

Department of Medicine, University of Wisconsin-Madison, Madison, WI, United States.
Majid Afshar

Loyola University Chicago, Chicago, IL.

Keywords

Electronic Health Records Humans Natural Language Processing Unified Medical Language System

External Resources

View on PubMed PubMed (40417585)

Development of a Human Evaluation Framework and Correlation with Automated Metrics for Natural Language Generation of Medical Diagnoses.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals