MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks

Journal: arXiv

Published Date: May 26, 2025

Abstract

While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcategories, and 121 tasks developed with 29 clinicians. Second, a comprehensive benchmark suite comprising 35 benchmarks (17 existing, 18 newly formulated) providing complete coverage of all categories and subcategories in the taxonomy. Third, a systematic comparison of LLMs with improved evaluation methods (using an LLM-jury) and a cost-performance analysis. Evaluation of 9 frontier LLMs, using the 35 benchmarks, revealed significant performance variation. Advanced reasoning models (DeepSeek R1: 66% win-rate; o3-mini: 64% win-rate) demonstrated superior performance, though Claude 3.5 Sonnet achieved comparable results at 40% lower estimated computational cost. On a normalized accuracy scale (0-1), most models performed strongly in Clinical Note Generation (0.73-0.85) and Patient Communication & Education (0.78-0.83), moderately in Medical Research Assistance (0.65-0.75), and generally lower in Clinical Decision Support (0.56-0.72) and Administration & Workflow (0.53-0.63). Our LLM-jury evaluation method achieved good agreement with clinician ratings (ICC = 0.47), surpassing both average clinician-clinician agreement (ICC = 0.43) and automated baselines including ROUGE-L (0.36) and BERTScore-F1 (0.44). Claude 3.5 Sonnet achieved comparable performance to top models at lower estimated cost. These findings highlight the importance of real-world, task-specific evaluation for medical use of LLMs and provides an open source framework to enable this.

Authors

Suhana Bedi
Hejie Cui
Miguel Fuentes
Alyssa Unell
Michael Wornow
Juan M. Banda
Nikesh Kotecha
Timothy Keyes
Yifan Mai
Mert Oez
Hao Qiu
Shrey Jain
Leonardo Schettini
Mehr Kashyap
Jason Alan Fries
Akshay Swaminathan
Philip Chung
Fateme Nateghi
Asad Aali
Ashwin Nayak
Shivam Vedak
Sneha S. Jain
Birju Patel
Oluseyi Fayanju
Shreya Shah
Ethan Goh
Dong-han Yao
Brian Soetikno
Eduardo Reis
Sergios Gatidis
Vasu Divi
Robson Capasso
Rachna Saralkar
Chia-Chun Chiang
Jenelle Jindal
Tho Pham
Faraz Ghoddusi
Steven Lin
Albert S. Chiou
Christy Hong
Mohana Roy
Michael F. Gensheimer
Hinesh Patel
Kevin Schulman
Dev Dash
Danton Char
Lance Downing
Francois Grolleau
Kameron Black
Bethel Mieso
Aydin Zahedivash
Wen-wai Yim
Harshita Sharma
Tony Lee
Hannah Kirsch
Jennifer Lee
Nerissa Ambers
Carlene Lugtu
Aditya Sharma
Bilal Mawji
Alex Alekseyev
Vicky Zhou
Vikas Kakkar
Jarrod Helzer
Anurang Revri
Yair Bannett
Roxana Daneshjou
Jonathan Chen
Emily Alsentzer
Keith Morse
Nirmal Ravi
Nima Aghaeepour
Vanessa Kennedy
Akshay Chaudhari
Thomas Wang
Sanmi Koyejo
Matthew P. Lungren
Eric Horvitz
Percy Liang
Mike Pfeffer
Nigam H. Shah

External Resources

View on arXiv arXiv (http://arxiv.org/abs/2505.23802v2)

MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks

Abstract

Authors

Categories

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals