Chain-of-Lure: A Synthetic Narrative-Driven Approach to Compromise Large Language Models
Journal:
arXiv
Published Date:
May 23, 2025
Abstract
In the era of rapid generative AI development, interactions between humans
and large language models face significant misusing risks. Previous research
has primarily focused on black-box scenarios using human-guided prompts and
white-box scenarios leveraging gradient-based LLM generation methods,
neglecting the possibility that LLMs can act not only as victim models, but
also as attacker models to harm other models. We proposes a novel jailbreaking
method inspired by the Chain-of-Thought mechanism, where the attacker model
uses mission transfer to conceal harmful user intent in dialogue and generates
chained narrative lures to stimulate the reasoning capabilities of victim
models, leading to successful jailbreaking. To enhance the attack success rate,
we introduce a helper model that performs random narrative optimization on the
narrative lures during multi-turn dialogues while ensuring alignment with the
original intent, enabling the optimized lures to bypass the safety barriers of
victim models effectively. Our experiments reveal that models with weaker
safety mechanisms exhibit stronger attack capabilities, demonstrating that
models can not only be exploited, but also help harm others. By incorporating
toxicity scores, we employ third-party models to evaluate the harmfulness of
victim models' responses to jailbreaking attempts. The study shows that using
refusal keywords as an evaluation metric for attack success rates is
significantly flawed because it does not assess whether the responses guide
harmful questions, while toxicity scores measure the harm of generated content
with more precision and its alignment with harmful questions. Our approach
demonstrates outstanding performance, uncovering latent vulnerabilities in LLMs
and providing data-driven feedback to optimize LLM safety mechanisms. We also
discuss two defensive strategies to offer guidance on improving defense
mechanisms.