Towards Safe AI Clinicians: A Comprehensive Study on Large Language Model Jailbreaking in Healthcare
Journal:
arXiv
Published Date:
Jan 27, 2025
Abstract
Large language models (LLMs) are increasingly utilized in healthcare
applications. However, their deployment in clinical practice raises significant
safety concerns, including the potential spread of harmful information. This
study systematically assesses the vulnerabilities of seven LLMs to three
advanced black-box jailbreaking techniques within medical contexts. To quantify
the effectiveness of these techniques, we propose an automated and
domain-adapted agentic evaluation pipeline. Experiment results indicate that
leading commercial and open-source LLMs are highly vulnerable to medical
jailbreaking attacks. To bolster model safety and reliability, we further
investigate the effectiveness of Continual Fine-Tuning (CFT) in defending
against medical adversarial attacks. Our findings underscore the necessity for
evolving attack methods evaluation, domain-specific safety alignment, and LLM
safety-utility balancing. This research offers actionable insights for
advancing the safety and reliability of AI clinicians, contributing to ethical
and effective AI deployment in healthcare.