Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models
Journal:
arXiv
Published Date:
Jan 30, 2025
Abstract
Large Vision-Language Models (VLMs) have achieved remarkable performance
across a wide range of tasks. However, their deployment in safety-critical
domains poses significant challenges. Existing safety fine-tuning methods,
which focus on textual or multimodal content, fall short in addressing
challenging cases or disrupt the balance between helpfulness and harmlessness.
Our evaluation highlights a safety reasoning gap: these methods lack safety
visual reasoning ability, leading to such bottlenecks. To address this
limitation and enhance both visual perception and reasoning in safety-critical
contexts, we propose a novel dataset that integrates multi-image inputs with
safety Chain-of-Thought (CoT) labels as fine-grained reasoning logic to improve
model performance. Specifically, we introduce the Multi-Image Safety (MIS)
dataset, an instruction-following dataset tailored for multi-image safety
scenarios, consisting of training and test splits. Our experiments demonstrate
that fine-tuning InternVL2.5-8B with MIS significantly outperforms both
powerful open-source models and API-based models in challenging multi-image
tasks requiring safety-related visual reasoning. This approach not only
delivers exceptional safety performance but also preserves general capabilities
without any trade-offs. Specifically, fine-tuning with MIS increases average
accuracy by 0.83% across five general benchmarks and reduces the Attack Success
Rate (ASR) on multiple safety benchmarks by a large margin.