Beyond Accuracy: An Efficiency- and Safety-Aware Framework for Evaluating Clinical AI with Large Language Models

Journal: medRxiv
Published Date:

Abstract

Large language models (LLMs) demonstrate strong performance on medical reasoning tasks, but current evaluation approaches focus primarily on accuracy, neglecting the efficiency–safety trade-offs critical for real-world clinical utility. We developed and validated the Clinical Value Density (CVD) framework, a novel metric quantifying clinical utility per unit of cognitive resource consumed. Six state-of-the-art LLMs (GPT-4o, Gemini-2.5, Claude-Sonnet-4, Grok-3, DeepSeek-R1, and Kimi-K2) were evaluated across 60 authentic clinical pharmacology scenarios derived from UAE healthcare practice and benchmarked against board-certified clinical pharmacist responses. Performance was assessed across efficiency, semantic similarity, safety, relevance, consistency, and conciseness, with triangulated validation against clinician preference and task efficiency. Traditional metrics (BLEU: 0.003–0.024; ROUGE-L: 0.079–0.166) failed to reflect clinical utility, while the CVD framework exposed critical efficiency–safety trade-offs. GPT-4o achieved the highest normalized CVD (0.475), delivering fourfold efficiency gains over pharmacists (41 vs 178 tokens) but with moderate safety (0.437), requiring supervised deployment. Grok-3 and Gemini-2.5 led in safety (0.605 and 0.542) but at the cost of efficiency. DeepSeek-R1 and Kimi-K2 produced unsafe brevity–accuracy trade-offs, generating concise but clinically inaccurate responses. Comparative validation revealed strong alignment between CVD and task efficiency (r = 0.924), but divergence from clinician preference (r = –0.845), reflecting a bias toward verbose outputs. Current LLMs cannot reliably perform clinical functions autonomously. Instead, they are best positioned for AI-assisted, supervised integration, where efficiency and safety are balanced under professional oversight. The CVD framework provides a potentially useful for regulatory evaluation for quantifying deployment readiness, aligning AI evaluation with the real-world constraints of cognitive load, time, and patient safety. Future research should extend CVD across specialties, scale validation datasets, and conduct real-time workflow trials to establish specialty-specific safety thresholds for eventual autonomy. Large language models (LLMs) are impressively accurate on medical reasoning benchmarks, yet clinical work is not a quiz, it is a race against time under strict safety constraints. We introduce Clinical Value Density (CVD), a simple yet powerful metric that captures clinical utility per unit of cognitive cost (i.e., how much useful care an answer delivers for the effort it imposes on clinicians). In a head-to-head evaluation of six state-of-the-art LLMs across 60 authentic clinical pharmacology scenarios benchmarked against board-certified pharmacists, traditional metrics (BLEU/ROUGE) barely moved, while CVD exposed the true efficiency–safety trade-offs that determine bedside usefulness. For example, GPT-4o delivered the highest normalized CVD (0.475) with ∼4× fewer tokens than pharmacists (≈41 vs 178), but only moderate safety (0.437) appropriate for supervised use, not autonomy. Conversely, Grok-3 and Gemini-2.5 were safest but less efficient; DeepSeek-R1 and Kimi-K2 risked unsafe brevity– accuracy trade-offs. Crucially, CVD strongly tracked task efficiency (r=0.924) yet diverged from clinician preference (r=–0.845), revealing a bias toward overly verbose answers that feel reassuring but slow care.

Authors

  • Nazar Zaki; Amal Akor; Salahdein Aburuz; Sham ZainAlAbdin