Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation
Journal:
arXiv
Published Date:
Jun 26, 2025
Abstract
Given the growing influence of language model-based agents on high-stakes
societal decisions, from public policy to healthcare, ensuring their beneficial
impact requires understanding the far-reaching implications of their
suggestions. We propose a proof-of-concept framework that projects how
model-generated advice could propagate through societal systems on a
macroscopic scale over time, enabling more robust alignment. To assess the
long-term safety awareness of language models, we also introduce a dataset of
100 indirect harm scenarios, testing models' ability to foresee adverse,
non-obvious outcomes from seemingly harmless user prompts. Our approach
achieves not only over 20% improvement on the new dataset but also an average
win rate exceeding 70% against strong baselines on existing safety benchmarks
(AdvBench, SafeRLHF, WildGuardMix), suggesting a promising direction for safer
agents.