Dissecting HealthBench: Disease Spectrum, Clinical Diversity, and Data Insights from Multi-Turn Clinical AI Evaluation Benchmark.
Journal:
Journal of medical systems
Published Date:
Jul 28, 2025
Abstract
HealthBench is an open-source, large-scale benchmark consisting of 5,000 multi-turn clinical conversations evaluated against 48,562 criteria developed by clinicians. Recognized as a significant advancement in assessing realistic artificial intelligence (AI) models, HealthBench deserves further exploration. In this article, we systematically analyze the benchmark's disease spectrum, diagnostic and therapeutic focuses, and demographic diversity. We evaluate its representativeness and strengths, as well as the essential limitations that AI researchers and clinicians should consider when using it for realistic model evaluations.