MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks

Journal: arXiv
Published Date:

Abstract

While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcategories, and 121 tasks developed with 29 clinicians. Second, a comprehensive benchmark suite comprising 35 benchmarks (17 existing, 18 newly formulated) providing complete coverage of all categories and subcategories in the taxonomy. Third, a systematic comparison of LLMs with improved evaluation methods (using an LLM-jury) and a cost-performance analysis. Evaluation of 9 frontier LLMs, using the 35 benchmarks, revealed significant performance variation. Advanced reasoning models (DeepSeek R1: 66% win-rate; o3-mini: 64% win-rate) demonstrated superior performance, though Claude 3.5 Sonnet achieved comparable results at 40% lower estimated computational cost. On a normalized accuracy scale (0-1), most models performed strongly in Clinical Note Generation (0.73-0.85) and Patient Communication & Education (0.78-0.83), moderately in Medical Research Assistance (0.65-0.75), and generally lower in Clinical Decision Support (0.56-0.72) and Administration & Workflow (0.53-0.63). Our LLM-jury evaluation method achieved good agreement with clinician ratings (ICC = 0.47), surpassing both average clinician-clinician agreement (ICC = 0.43) and automated baselines including ROUGE-L (0.36) and BERTScore-F1 (0.44). Claude 3.5 Sonnet achieved comparable performance to top models at lower estimated cost. These findings highlight the importance of real-world, task-specific evaluation for medical use of LLMs and provides an open source framework to enable this.

Authors

  • Suhana Bedi
  • Hejie Cui
  • Miguel Fuentes
  • Alyssa Unell
  • Michael Wornow
  • Juan M. Banda
  • Nikesh Kotecha
  • Timothy Keyes
  • Yifan Mai
  • Mert Oez
  • Hao Qiu
  • Shrey Jain
  • Leonardo Schettini
  • Mehr Kashyap
  • Jason Alan Fries
  • Akshay Swaminathan
  • Philip Chung
  • Fateme Nateghi
  • Asad Aali
  • Ashwin Nayak
  • Shivam Vedak
  • Sneha S. Jain
  • Birju Patel
  • Oluseyi Fayanju
  • Shreya Shah
  • Ethan Goh
  • Dong-han Yao
  • Brian Soetikno
  • Eduardo Reis
  • Sergios Gatidis
  • Vasu Divi
  • Robson Capasso
  • Rachna Saralkar
  • Chia-Chun Chiang
  • Jenelle Jindal
  • Tho Pham
  • Faraz Ghoddusi
  • Steven Lin
  • Albert S. Chiou
  • Christy Hong
  • Mohana Roy
  • Michael F. Gensheimer
  • Hinesh Patel
  • Kevin Schulman
  • Dev Dash
  • Danton Char
  • Lance Downing
  • Francois Grolleau
  • Kameron Black
  • Bethel Mieso
  • Aydin Zahedivash
  • Wen-wai Yim
  • Harshita Sharma
  • Tony Lee
  • Hannah Kirsch
  • Jennifer Lee
  • Nerissa Ambers
  • Carlene Lugtu
  • Aditya Sharma
  • Bilal Mawji
  • Alex Alekseyev
  • Vicky Zhou
  • Vikas Kakkar
  • Jarrod Helzer
  • Anurang Revri
  • Yair Bannett
  • Roxana Daneshjou
  • Jonathan Chen
  • Emily Alsentzer
  • Keith Morse
  • Nirmal Ravi
  • Nima Aghaeepour
  • Vanessa Kennedy
  • Akshay Chaudhari
  • Thomas Wang
  • Sanmi Koyejo
  • Matthew P. Lungren
  • Eric Horvitz
  • Percy Liang
  • Mike Pfeffer
  • Nigam H. Shah