Distorting Embedding Space for Safety: A Defense Mechanism for Adversarially Robust Diffusion Models
Journal:
arXiv
Published Date:
Jan 31, 2025
Abstract
Text-to-image diffusion models show remarkable generation performance
following text prompts, but risk generating Not Safe For Work (NSFW) contents
from unsafe prompts. Existing approaches, such as prompt filtering or concept
unlearning, fail to defend against adversarial attacks while maintaining benign
image quality. In this paper, we propose a novel approach called Distorting
Embedding Space (DES), a text encoder-based defense mechanism that effectively
tackles these issues through innovative embedding space control. DES transforms
unsafe embeddings, extracted from a text encoder using unsafe prompts, toward
carefully calculated safe embedding regions to prevent unsafe contents
generation, while reproducing the original safe embeddings. DES also
neutralizes the nudity embedding, extracted using prompt ``nudity", by aligning
it with neutral embedding to enhance robustness against adversarial attacks.
These methods ensure both robust defense and high-quality image generation.
Additionally, DES can be adopted in a plug-and-play manner and requires zero
inference overhead, facilitating its deployment. Extensive experiments on
diverse attack types, including black-box and white-box scenarios, demonstrate
DES's state-of-the-art performance in both defense capability and benign image
generation quality. Our model is available at https://github.com/aei13/DES.