Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights
Journal:
arXiv
Published Date:
Jun 6, 2025
Abstract
The application scope of Large Language Models (LLMs) continues to expand,
leading to increasing interest in personalized LLMs that align with human
values. However, aligning these models with individual values raises
significant safety concerns, as certain values may correlate with harmful
information. In this paper, we identify specific safety risks associated with
value-aligned LLMs and investigate the psychological principles behind these
challenges. Our findings reveal two key insights. (1) Value-aligned LLMs are
more prone to harmful behavior compared to non-fine-tuned models and exhibit
slightly higher risks in traditional safety evaluations than other fine-tuned
models. (2) These safety issues arise because value-aligned LLMs genuinely
generate text according to the aligned values, which can amplify harmful
outcomes. Using a dataset with detailed safety categories, we find significant
correlations between value alignment and safety risks, supported by
psychological hypotheses. This study offers insights into the "black box" of
value alignment and proposes in-context alignment methods to enhance the safety
of value-aligned LLMs.