CAD: A General Multimodal Framework for Video Deepfake Detection via Cross-Modal Alignment and Distillation
Journal:
arXiv
Published Date:
May 21, 2025
Abstract
The rapid emergence of multimodal deepfakes (visual and auditory content are
manipulated in concert) undermines the reliability of existing detectors that
rely solely on modality-specific artifacts or cross-modal inconsistencies. In
this work, we first demonstrate that modality-specific forensic traces (e.g.,
face-swap artifacts or spectral distortions) and modality-shared semantic
misalignments (e.g., lip-speech asynchrony) offer complementary evidence, and
that neglecting either aspect limits detection performance. Existing approaches
either naively fuse modality-specific features without reconciling their
conflicting characteristics or focus predominantly on semantic misalignment at
the expense of modality-specific fine-grained artifact cues. To address these
shortcomings, we propose a general multimodal framework for video deepfake
detection via Cross-Modal Alignment and Distillation (CAD). CAD comprises two
core components: 1) Cross-modal alignment that identifies inconsistencies in
high-level semantic synchronization (e.g., lip-speech mismatches); 2)
Cross-modal distillation that mitigates feature conflicts during fusion while
preserving modality-specific forensic traces (e.g., spectral distortions in
synthetic audio). Extensive experiments on both multimodal and unimodal (e.g.,
image-only/video-only)deepfake benchmarks demonstrate that CAD significantly
outperforms previous methods, validating the necessity of harmonious
integration of multimodal complementary information.