AutoMedEval: Harnessing Language Models for Automatic Medical Capability Evaluation
Journal:
arXiv
Published Date:
May 17, 2025
Abstract
With the proliferation of large language models (LLMs) in the medical domain,
there is increasing demand for improved evaluation techniques to assess their
capabilities. However, traditional metrics like F1 and ROUGE, which rely on
token overlaps to measure quality, significantly overlook the importance of
medical terminology. While human evaluation tends to be more reliable, it can
be very costly and may as well suffer from inaccuracies due to limits in human
expertise and motivation. Although there are some evaluation methods based on
LLMs, their usability in the medical field is limited due to their proprietary
nature or lack of expertise. To tackle these challenges, we present
AutoMedEval, an open-sourced automatic evaluation model with 13B parameters
specifically engineered to measure the question-answering proficiency of
medical LLMs. The overarching objective of AutoMedEval is to assess the quality
of responses produced by diverse models, aspiring to significantly reduce the
dependence on human evaluation. Specifically, we propose a hierarchical
training method involving curriculum instruction tuning and an iterative
knowledge introspection mechanism, enabling AutoMedEval to acquire professional
medical assessment capabilities with limited instructional data. Human
evaluations indicate that AutoMedEval surpasses other baselines in terms of
correlation with human judgments.