Automating Safety Enhancement for LLM-based Agents with Synthetic Risk Scenarios
Journal:
arXiv
Published Date:
May 23, 2025
Abstract
Large Language Model (LLM)-based agents are increasingly deployed in
real-world applications such as "digital assistants, autonomous customer
service, and decision-support systems", where their ability to "interact in
multi-turn, tool-augmented environments" makes them indispensable. However,
ensuring the safety of these agents remains a significant challenge due to the
diverse and complex risks arising from dynamic user interactions, external tool
usage, and the potential for unintended harmful behaviors. To address this
critical issue, we propose AutoSafe, the first framework that systematically
enhances agent safety through fully automated synthetic data generation.
Concretely, 1) we introduce an open and extensible threat model, OTS, which
formalizes how unsafe behaviors emerge from the interplay of user instructions,
interaction contexts, and agent actions. This enables precise modeling of
safety risks across diverse scenarios. 2) we develop a fully automated data
generation pipeline that simulates unsafe user behaviors, applies
self-reflective reasoning to generate safe responses, and constructs a
large-scale, diverse, and high-quality safety training dataset-eliminating the
need for hazardous real-world data collection. To evaluate the effectiveness of
our framework, we design comprehensive experiments on both synthetic and
real-world safety benchmarks. Results demonstrate that AutoSafe boosts safety
scores by 45% on average and achieves a 28.91% improvement on real-world tasks,
validating the generalization ability of our learned safety strategies. These
results highlight the practical advancement and scalability of AutoSafe in
building safer LLM-based agents for real-world deployment. We have released the
project page at https://auto-safe.github.io/.