CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs
Journal:
arXiv
Published Date:
May 16, 2025
Abstract
Large language models (LLMs) are increasingly deployed in medical contexts,
raising critical concerns about safety, alignment, and susceptibility to
adversarial manipulation. While prior benchmarks assess model refusal
capabilities for harmful prompts, they often lack clinical specificity, graded
harmfulness levels, and coverage of jailbreak-style attacks. We introduce CARES
(Clinical Adversarial Robustness and Evaluation of Safety), a benchmark for
evaluating LLM safety in healthcare. CARES includes over 18,000 prompts
spanning eight medical safety principles, four harm levels, and four prompting
styles: direct, indirect, obfuscated, and role-play, to simulate both malicious
and benign use cases. We propose a three-way response evaluation protocol
(Accept, Caution, Refuse) and a fine-grained Safety Score metric to assess
model behavior. Our analysis reveals that many state-of-the-art LLMs remain
vulnerable to jailbreaks that subtly rephrase harmful prompts, while also
over-refusing safe but atypically phrased queries. Finally, we propose a
mitigation strategy using a lightweight classifier to detect jailbreak attempts
and steer models toward safer behavior via reminder-based conditioning. CARES
provides a rigorous framework for testing and improving medical LLM safety
under adversarial and ambiguous conditions.