Rx-LLM: a benchmarking suite to evaluate safe large language model performance for medication-related tasks
Journal:
medRxiv
Published Date:
Jan 1, 2025
Abstract
For large language models (LLMs) to reach their potential as information technology tools that make medication use safer, clinically relevant benchmarks capable of automated grading and designed specifically to measure the performance of LLMs for medication tasks are required. The purpose of this study was to design a suite of benchmarking tests reflective of Comprehensive Medication Management (CMM; the standard of care for medication optimization) and quantify the baseline performance of the latest LLMs. We established six benchmarks representing critical stages of the CMM process: drug formulation matching, drug order (sig) generation, drug route matching, drug-drug interaction identification, renal dose identification, and drug-indication matching. For each benchmark, we curated a clinician-annotated dataset comprising 250 standardized input-output pairs including both inpatient and outpatient medications. We evaluated the clinical knowledge retrieval capabilities of three LLMs: GPT-4o-mini, MedGemma-27B, and LLaMA3-70B. We employed a zero-shot prompting strategy, excluding in-context examples, to assess the models’ internal clinical knowledge rather than their few-shot learning potential. To check reliability, each model was run three times using a temperature of 0.7 (a mid-range value of an LLM setting controlling text generation randomness). Performance was assessed using task-specific evaluation metrics including precision (positive predictive value), recall (sensitivity), F1-score, accuracy, and correctness consistency across trials. Across six benchmarks, LLaMA3-70B demonstrated the highest performance in four tasks: drug-formulation matching (F1, 54.0% [95 CI: 50.1-58]), drug-order generation (accuracy, 88.0%), drug-route identification (F1, 74.3% [95 CI: 71-78]), and drug-indication identification (accuracy, 97.6% [95 CI: 95.6-99.2]). In the drug–drug interaction task, GPT-4o-mini achieved the highest overall accuracy (70.4% [95 CI: 64.8-75.7]). For renal dose–adjustment identification, GPT-4o-mini demonstrated the highest F1 score (83.3% [95 CI: 77.6-88]). Correctness-consistency scores ranged from 8.0% to 97.6% across benchmarks, with no model exhibiting uniformly superior consistency. Model performance varied substantially across medication-related tasks. LLaMA3-70B demonstrated promising baseline performance in tasks involving formulation, ordering, route, and indication. GPT-4o-mini showed potential advantages in drug–drug interaction detection and renal dose adjustment. These findings underscore the need for task-specific evaluation when deploying models for medication-focused clinical decision support.