The Structural Safety Generalization Problem
Journal:
arXiv
Published Date:
Apr 13, 2025
Abstract
LLM jailbreaks are a widespread safety challenge. Given this problem has not
yet been tractable, we suggest targeting a key failure mechanism: the failure
of safety to generalize across semantically equivalent inputs. We further focus
the target by requiring desirable tractability properties of attacks to study:
explainability, transferability between models, and transferability between
goals. We perform red-teaming within this framework by uncovering new
vulnerabilities to multi-turn, multi-image, and translation-based attacks.
These attacks are semantically equivalent by our design to their single-turn,
single-image, or untranslated counterparts, enabling systematic comparisons; we
show that the different structures yield different safety outcomes. We then
demonstrate the potential for this framework to enable new defenses by
proposing a Structure Rewriting Guardrail, which converts an input to a
structure more conducive to safety assessment. This guardrail significantly
improves refusal of harmful inputs, without over-refusing benign ones. Thus, by
framing this intermediate challenge - more tractable than universal defenses
but essential for long-term safety - we highlight a critical milestone for AI
safety research.