Aggregate benchmark scores obscure patient safety implications of errors across frontier language models
Journal:
medRxiv
Published Date:
Mar 20, 2026
Abstract
Frontier language models are widely used for health-related queries, yet aggregate benchmark scores do not capture safety implications of errors. We applied the recent Nature Medicine triage benchmark across nine frontier models, comparing directional error profiles, contextual bias, and crisis calibration. In-range accuracy ranged from 75.0% to 87.7%, obscuring clinically meaningful error differences. Looking at directionality of errors, under-triage ranged from 0.0% (GPT-5.2) to 12.3% (GPT-5-mini), over-triage varied independently (9.4-36.9%), and under-triage was uncorrelated with aggregate accuracy. When family members minimized symptoms, all models tested shifted toward lower acuity in ambiguous cases (OR range 2.9-14.9), the only contextual effect observed consistently, and access barriers increased under-triage risk in six. Suicide crisis resource mention rates were low and variable across all models. This cross-model heterogeneity and non-monotonic performance across model generations show that aggregate accuracy alone cannot characterize, rank, or predict the clinical safety of deployed language models.