Algorithmic Versus Expert Rankings of Large Language Models in Peritoneal Dialysis Prescription Review: A Trap-Embedded Synthetic Benchmark
Journal:
medRxiv
Published Date:
Jun 1, 2026
Abstract
Background: Clinical LLM benchmarks rarely test whether algorithmic rankings agree with expert clinical judgment. We developed a trap-embedded peritoneal dialysis (PD) benchmark comparing multiple scoring constructs with blinded nephrologist ratings. Methods: We generated 125 synthetic PD cases containing 13 ISPD-aligned trap types. Five LLMs (Claude Sonnet 4.5, GPT-5.4, Gemini 3.1 Pro, DeepSeek-R1, Grok 4.1 Fast) evaluated each case three times at temperature 0 (1,875 calls). Primary outcome was must-identify TDR_must, analyzed with GEE and case-clustered bootstrap. Secondary analyses included a verbosity-sensitive alarm-burden proxy, WCS, relaxed-match scoring, WCS sensitivity analyses, and a 25-output blinded expert adequacy substudy. Must-identify kappa was 0.89 in Stage 1 and 0.92 in Stage 2. Results: Rankings were discordant. Recall ranked Claude (0.977) and GPT-5.4 (0.955) above the other models (0.86-0.90, p<0.0001). The alarm-burden proxy favored concise models (Grok 0.689; 21.6 vs 2.4 issues/case), while WCS produced a third ordering. In the expert substudy, inter-rater concordance was strong (rho 0.977), but WCS did not show a positive association with expert adequacy (rho -0.17, p=0.41). Conclusion: Clinical LLM rankings in PD prescription review depend strongly on scoring construct. Algorithmic metrics should be reported alongside blinded expert adequacy ratings and should not alone determine deployment.