A Scalable Framework for Evaluating Health Language Models
Journal:
arXiv
Published Date:
Mar 30, 2025
Abstract
Large language models (LLMs) have emerged as powerful tools for analyzing
complex datasets. Recent studies demonstrate their potential to generate
useful, personalized responses when provided with patient-specific health
information that encompasses lifestyle, biomarkers, and context. As LLM-driven
health applications are increasingly adopted, rigorous and efficient one-sided
evaluation methodologies are crucial to ensure response quality across multiple
dimensions, including accuracy, personalization and safety. Current evaluation
practices for open-ended text responses heavily rely on human experts. This
approach introduces human factors and is often cost-prohibitive,
labor-intensive, and hinders scalability, especially in complex domains like
healthcare where response assessment necessitates domain expertise and
considers multifaceted patient data. In this work, we introduce Adaptive
Precise Boolean rubrics: an evaluation framework that streamlines human and
automated evaluation of open-ended questions by identifying gaps in model
responses using a minimal set of targeted rubrics questions. Our approach is
based on recent work in more general evaluation settings that contrasts a
smaller set of complex evaluation targets with a larger set of more precise,
granular targets answerable with simple boolean responses. We validate this
approach in metabolic health, a domain encompassing diabetes, cardiovascular
disease, and obesity. Our results demonstrate that Adaptive Precise Boolean
rubrics yield higher inter-rater agreement among expert and non-expert human
evaluators, and in automated assessments, compared to traditional Likert
scales, while requiring approximately half the evaluation time of Likert-based
methods. This enhanced efficiency, particularly in automated evaluation and
non-expert contributions, paves the way for more extensive and cost-effective
evaluation of LLMs in health.