AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization
Journal:
arXiv
Published Date:
Apr 2, 2025
Abstract
Large Vision-Language Models (LVLMs), such as GPT-4o and LLaVA, have recently
witnessed remarkable advancements and are increasingly being deployed in
real-world applications. However, inheriting the sensitivity of visual neural
networks, LVLMs remain vulnerable to adversarial attacks, which can result in
erroneous or malicious outputs. While existing efforts utilize adversarial
fine-tuning to enhance robustness, they often suffer from performance
degradation on clean inputs. In this paper, we proposes AdPO, a novel
adversarial defense strategy for LVLMs based on preference optimization. For
the first time, we reframe adversarial training as a preference optimization
problem, aiming to enhance the model's preference for generating normal outputs
on clean inputs while rejecting the potential misleading outputs for
adversarial examples. Notably, AdPO achieves this by solely modifying the image
encoder, e.g., CLIP ViT, resulting in superior clean and adversarial
performance in a variety of downsream tasks. Considering that training involves
large language models (LLMs), the computational cost increases significantly.
We validate that training on smaller LVLMs and subsequently transferring to
larger models can achieve competitive performance while maintaining efficiency
comparable to baseline methods. Our comprehensive experiments confirm the
effectiveness of the proposed AdPO, which provides a novel perspective for
future adversarial defense research.