Aggregate benchmark scores obscure patient safety implications of errors across frontier language models

Journal: medRxiv

Published Date: Mar 20, 2026

Abstract

Frontier language models are widely used for health-related queries, yet aggregate benchmark scores do not capture safety implications of errors. We applied the recent Nature Medicine triage benchmark across nine frontier models, comparing directional error profiles, contextual bias, and crisis calibration. In-range accuracy ranged from 75.0% to 87.7%, obscuring clinically meaningful error differences. Looking at directionality of errors, under-triage ranged from 0.0% (GPT-5.2) to 12.3% (GPT-5-mini), over-triage varied independently (9.4-36.9%), and under-triage was uncorrelated with aggregate accuracy. When family members minimized symptoms, all models tested shifted toward lower acuity in ambiguous cases (OR range 2.9-14.9), the only contextual effect observed consistently, and access barriers increased under-triage risk in six. Suicide crisis resource mention rates were low and variable across all models. This cross-model heterogeneity and non-monotonic performance across model generations show that aggregate accuracy alone cannot characterize, rank, or predict the clinical safety of deployed language models.

Authors

Linzmayer
R.; Ramaswamy
A.; Hugo
H.; Nadkarni
G.; Elhadad
N.

External Resources

View on medRxiv Access via DOI

Aggregate benchmark scores obscure patient safety implications of errors across frontier language models

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

Aggregate benchmark scores obscure patient safety implications of errors across frontier language models

Abstract

Authors

Categories

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals