Benchmarking Ethical and Safety Risks of Healthcare LLMs in China-Toward Systemic Governance under Healthy China 2030
Journal:
arXiv
Published Date:
May 12, 2025
Abstract
Large Language Models (LLMs) are poised to transform healthcare under China's
Healthy China 2030 initiative, yet they introduce new ethical and
patient-safety challenges. We present a novel 12,000-item Q&A benchmark
covering 11 ethics and 9 safety dimensions in medical contexts, to
quantitatively evaluate these risks. Using this dataset, we assess
state-of-the-art Chinese medical LLMs (e.g., Qwen 2.5-32B, DeepSeek), revealing
moderate baseline performance (accuracy 42.7% for Qwen 2.5-32B) and significant
improvements after fine-tuning on our data (up to 50.8% accuracy). Results show
notable gaps in LLM decision-making on ethics and safety scenarios, reflecting
insufficient institutional oversight. We then identify systemic governance
shortfalls-including the lack of fine-grained ethical audit protocols, slow
adaptation by hospital IRBs, and insufficient evaluation tools-that currently
hinder safe LLM deployment. Finally, we propose a practical governance
framework for healthcare institutions (embedding LLM auditing teams, enacting
data ethics guidelines, and implementing safety simulation pipelines) to
proactively manage LLM risks. Our study highlights the urgent need for robust
LLM governance in Chinese healthcare, aligning AI innovation with patient
safety and ethical standards.