Zero-Shot Defense Against Toxic Images via Inherent Multimodal Alignment in LVLMs
Journal:
arXiv
Published Date:
Feb 25, 2025
Abstract
Large Vision-Language Models (LVLMs) have made significant strides in
multimodal comprehension, thanks to extensive pre-training and fine-tuning on
large-scale visual datasets. However, despite their robust textual safety
mechanisms, they remain vulnerable to harmful visual inputs. Existing
safeguards-typically relying on pre-filtering or fine-tuning-incur high costs
and diminish overall utility. To address this critical vulnerability, we
introduce SafeCLIP, a lightweight method that leverages LVLMs inherent
multimodal alignment for zero-shot toxic image detection. By projecting CLIPs
discarded CLS token into its text space and matching it with toxic descriptors,
SafeCLIP detects harmful content without any architectural changes-adding
minimal latency and enabling dynamic safety corrections during inference and
fine-tuning.Experiments show that SafeCLIP achieves a 66.9% defense success
rate with only 3.2% false positive rate and 7.2% overhead. In contrast,
state-of-the-art methods achieve 52.9% success but have a 10.7% false positive
rate and 210% overhead. Our work demonstrates that leveraging inherent
multimodal alignment can yield efficient, low-cost LVLM safety. Code is
available at anonymous.4open.science/r/safeclip-2C01.