Implicit Jailbreak Attacks via Cross-Modal Information Concealment on Vision-Language Models
Journal:
arXiv
Published Date:
May 22, 2025
Abstract
Multimodal large language models (MLLMs) enable powerful cross-modal
reasoning capabilities. However, the expanded input space introduces new attack
surfaces. Previous jailbreak attacks often inject malicious instructions from
text into less aligned modalities, such as vision. As MLLMs increasingly
incorporate cross-modal consistency and alignment mechanisms, such explicit
attacks become easier to detect and block. In this work, we propose a novel
implicit jailbreak framework termed IJA that stealthily embeds malicious
instructions into images via least significant bit steganography and couples
them with seemingly benign, image-related textual prompts. To further enhance
attack effectiveness across diverse MLLMs, we incorporate adversarial suffixes
generated by a surrogate model and introduce a template optimization module
that iteratively refines both the prompt and embedding based on model feedback.
On commercial models like GPT-4o and Gemini-1.5 Pro, our method achieves attack
success rates of over 90% using an average of only 3 queries.