Silent numerical failures in large language model-generated pharmacokinetic simulation code: a benchmark against target-controlled infusion validation criteria using the Marsh propofol model
Journal:
medRxiv
Published Date:
Apr 28, 2026
Abstract
Background. Large language models (LLMs) are increasingly used by clinicians to generate executable code for pharmacokinetic (PK) simulation. Whether such code meets the accuracy standards of target-controlled infusion systems has not been systematically evaluated. Methods. Five LLMs (ChatGPT, Claude, DeepSeek, Gemini, Grok) were prompted to generate Python code for the Marsh three-compartment propofol model under a standardized 120-minute bolus-plus-infusion regimen. Each LLM was tested in two phases: Phase 1, integrator free; Phase 2, fourth-order Runge-Kutta with 1-second step size mandated. Twenty runs per LLM per phase were collected (n = 200). Plasma concentrations were compared against a triple-validated reference using median prediction error (MDPE), median absolute prediction error (MDAPE), and Wobble. Runs were classified as Class A (MDAPE < 1 %), B (1-30 %), C ([≥] 30 %), or D (failed). Results. All 200 scripts were invokable and created a CSV file; 199/200 (99.5 %, 95 % CI 97.1-99.9 %) produced a valid time-concentration series. The remaining script (Gemini Phase 2 run 18) aborted during row formatting with ValueError and left a header-only CSV. Median MDAPE per LLM x phase ranged 0.0043-0.020 %, with 195/200 runs (97.5 %, 95 % CI 94.3-98.9 %) achieving Class A. Five runs (2.5 %, 95 % CI 1.1-5.7 %) were non-excellent or structurally defective: three were Class C due to time-scale/unit-handling errors (one DeepSeek run with a 6-second effective Euler step from a minute-as-second declaration, two Grok runs with min-1 rate constants applied per second), one was Class D (the empty-CSV failure above), and one was Class B but reflected a duplicated-bolus implementation error rather than a benign numerical deviation. Kruskal-Wallis testing showed significant inter-LLM heterogeneity across all metrics and phases (all omnibus p < 0.01). Strict compliance with Phase 2 directives was 98 % (98/100 runs; 95 % CI 93.1-99.5 %); lenient compliance accepting RK4-adaptive implementations as a superset was 100 % (100/100 runs; 95 % CI 96.3-100 %). Yet all three numerically divergent Phase 2 runs occurred under nominally compliant RK4/dt = 1 s configurations; the fourth non-Class-A Phase 2 outcome was a formatting failure that produced no usable trajectory. Conclusion. LLMs generate numerically accurate Marsh-model code in most runs but silently diverge in a clinically non-negligible minority. The Marsh model -- the simplest fixed-parameter three-compartment propofol model -- functioned here as a positive control: even so, three distinct classes of structural bug (unit/time-scale mismatch, duplicated-bolus event handling, malformed f-string formatting) slipped past apparent execution success. Two additional Phase 2 runs used an RK4-adaptive variant rather than classical RK4 and are therefore better interpreted as strict prompt non-compliance than as numerical failure. Prompt-level method specification substantially reduced algorithm-selection errors but did not eliminate unit or structural bugs. LLM-generated pharmacokinetic code requires reference-based validation before any safety-relevant use.