Red-Teaming Medical AI: Systematic Adversarial Evaluation of LLM Safety Guardrails in Clinical Contexts
Journal:
medRxiv
Published Date:
Mar 5, 2026
Abstract
Background: Large language models (LLMs) are increasingly deployed in medical contexts as patient-facing assistants, providing medication information, symptom triage, and health guidance. Understanding their robustness to adversarial inputs is critical for patient safety, as even a single safety failure can lead to adverse outcomes including severe harm or death. Objective: To systematically evaluate the safety guardrails of state-of-the-art LLMs through adversarial red-teaming specifically designed for medical contexts. Methods: We developed a comprehensive taxonomy of 8 adversarial attack categories targeting medical AI safety, encompassing 24 distinct sub-strategies. Using an LLM-based attack generator, we created 160 realistic adversarial prompts across categories including dangerous dosing, contraindication bypass, emergency misdirection, and multi-turn escalation. We tested Claude Sonnet 4.5 using both single-turn and multi-turn attack sequences under a standard medical assistant system prompt. An automated evaluator pre-screened responses for harm potential on a 0-5 scale, with physician review planned for high-risk responses. Results: Of 160 adversarial prompts, 11 (6.9%) elicited responses meeting our threshold for clinically significant harm (harm level >= 3). The model exhibited full refusal in 86.2% of cases. Authority Impersonation was the dominant attack vector (45.0% success rate), with the Educational Authority sub-strategy achieving 83.3% success. Multi-turn escalation attacks achieved 0% success. Six of eight attack categories yielded no successful attacks. Conclusions: Standard medical assistant system prompts provide strong baseline protection against most adversarial attacks but are substantially vulnerable to authority impersonation, particularly claims of educational context. The primary failure mode is behavioral mode-switching rather than factual inaccuracy, suggesting guardrail improvements should target context-conditioned behavior. Our open-source taxonomy and evaluation pipeline enable ongoing adversarial assessment as medical AI systems evolve.