Multimodal Conditional Information Bottleneck for Generalizable AI-Generated Image Detection
Journal:
arXiv
Published Date:
May 21, 2025
Abstract
Although existing CLIP-based methods for detecting AI-generated images have
achieved promising results, they are still limited by severe feature
redundancy, which hinders their generalization ability. To address this issue,
incorporating an information bottleneck network into the task presents a
straightforward solution. However, relying solely on image-corresponding
prompts results in suboptimal performance due to the inherent diversity of
prompts. In this paper, we propose a multimodal conditional bottleneck network
to reduce feature redundancy while enhancing the discriminative power of
features extracted by CLIP, thereby improving the model's generalization
ability. We begin with a semantic analysis experiment, where we observe that
arbitrary text features exhibit lower cosine similarity with real image
features than with fake image features in the CLIP feature space, a phenomenon
we refer to as "bias". Therefore, we introduce InfoFD, a text-guided
AI-generated image detection framework. InfoFD consists of two key components:
the Text-Guided Conditional Information Bottleneck (TGCIB) and Dynamic Text
Orthogonalization (DTO). TGCIB improves the generalizability of learned
representations by conditioning on both text and class modalities. DTO
dynamically updates weighted text features, preserving semantic information
while leveraging the global "bias". Our model achieves exceptional
generalization performance on the GenImage dataset and latest generative
models. Our code is available at https://github.com/Ant0ny44/InfoFD.