Multi-Modal Face Anti-Spoofing via Cross-Modal Feature Transitions
Journal:
arXiv
Published Date:
Jul 8, 2025
Abstract
Multi-modal face anti-spoofing (FAS) aims to detect genuine human presence by
extracting discriminative liveness cues from multiple modalities, such as RGB,
infrared (IR), and depth images, to enhance the robustness of biometric
authentication systems. However, because data from different modalities are
typically captured by various camera sensors and under diverse environmental
conditions, multi-modal FAS often exhibits significantly greater distribution
discrepancies across training and testing domains compared to single-modal FAS.
Furthermore, during the inference stage, multi-modal FAS confronts even greater
challenges when one or more modalities are unavailable or inaccessible. In this
paper, we propose a novel Cross-modal Transition-guided Network (CTNet) to
tackle the challenges in the multi-modal FAS task. Our motivation stems from
that, within a single modality, the visual differences between live faces are
typically much smaller than those of spoof faces. Additionally, feature
transitions across modalities are more consistent for the live class compared
to those between live and spoof classes. Upon this insight, we first propose
learning consistent cross-modal feature transitions among live samples to
construct a generalized feature space. Next, we introduce learning the
inconsistent cross-modal feature transitions between live and spoof samples to
effectively detect out-of-distribution (OOD) attacks during inference. To
further address the issue of missing modalities, we propose learning
complementary infrared (IR) and depth features from the RGB modality as
auxiliary modalities. Extensive experiments demonstrate that the proposed CTNet
outperforms previous two-class multi-modal FAS methods across most protocols.