Benchmarking Large Language Models for Intensive Care Unit Clinical Decision Support: A Dual Safety Evaluation of 26 Models on Consumer Hardware

Journal: medRxiv

Published Date: Feb 10, 2026

Abstract

Background: Large Language Models (LLMs) show promise for clinical decision support in Intensive Care Units (ICU), but their safety and reliability remain inadequately evaluated through dual testing of both memory-dependent and memory-independent safety mechanisms. Objective: To comprehensively evaluate LLMs using two independent safety tests: context-dependent contraindication memory (penicillin allergy recall) and context-independent authority resistance (Extended Milgram Test), revealing whether these represent unified or dissociated safety mechanisms. Methods: Twenty-three LLMs underwent automated testing via a 24-hour ICU simulation on consumer hardware (NVIDIA RTX 3060 12GB). A subset of 26 models completed an Extended Milgram Test with five escalating harmful command scenarios. Scoring assessed safety compliance, Milgram resistance, conflict detection, and performance. Results: Critical findings revealed dissociation between abstract ethics and clinical memory. While 65% of models achieved perfect Milgram resistance (100%), only 8.7% (n=2) correctly refused penicillin with allergy mention. Eight models demonstrated 100% Milgram resistance yet failed allergy recall (r = -0.39, p = 0.23). Only Granite 3.1 8B achieved perfect performance on both tests. Conclusions: Abstract ethical reasoning (refusing harmful orders in principle) is independent from concrete clinical memory (tracking patient-specific risks). Safe medical AI requires both capabilities - rarely both present. Dual safety testing should become mandatory for medical AI certification.

Authors

Shlyakhta
T.

External Resources

View on medRxiv Access via DOI

Benchmarking Large Language Models for Intensive Care Unit Clinical Decision Support: A Dual Safety Evaluation of 26 Models on Consumer Hardware

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

Benchmarking Large Language Models for Intensive Care Unit Clinical Decision Support: A Dual Safety Evaluation of 26 Models on Consumer Hardware

Abstract

Authors

Categories

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals