NurValues: Real-World Nursing Values Evaluation for Large Language Models in Clinical Context
Journal:
arXiv
Published Date:
May 13, 2025
Abstract
This work introduces the first benchmark for nursing value alignment,
consisting of five core value dimensions distilled from international nursing
codes: Altruism, Human Dignity, Integrity, Justice, and Professionalism. The
benchmark comprises 1,100 real-world nursing behavior instances collected
through a five-month longitudinal field study across three hospitals of varying
tiers. These instances are annotated by five clinical nurses and then augmented
with LLM-generated counterfactuals with reversed ethic polarity. Each original
case is paired with a value-aligned and a value-violating version, resulting in
2,200 labeled instances that constitute the Easy-Level dataset. To increase
adversarial complexity, each instance is further transformed into a
dialogue-based format that embeds contextual cues and subtle misleading
signals, yielding a Hard-Level dataset. We evaluate 23 state-of-the-art (SoTA)
LLMs on their alignment with nursing values. Our findings reveal three key
insights: (1) DeepSeek-V3 achieves the highest performance on the Easy-Level
dataset (94.55), where Claude 3.5 Sonnet outperforms other models on the
Hard-Level dataset (89.43), significantly surpassing the medical LLMs; (2)
Justice is consistently the most difficult nursing value dimension to evaluate;
and (3) in-context learning significantly improves alignment. This work aims to
provide a foundation for value-sensitive LLMs development in clinical settings.
The dataset and the code are available at
https://huggingface.co/datasets/Ben012345/NurValues.