FaceShield: Explainable Face Anti-Spoofing with Multimodal Large Language Models
Journal:
arXiv
Published Date:
May 14, 2025
Abstract
Face anti-spoofing (FAS) is crucial for protecting facial recognition systems
from presentation attacks. Previous methods approached this task as a
classification problem, lacking interpretability and reasoning behind the
predicted results. Recently, multimodal large language models (MLLMs) have
shown strong capabilities in perception, reasoning, and decision-making in
visual tasks. However, there is currently no universal and comprehensive MLLM
and dataset specifically designed for FAS task. To address this gap, we propose
FaceShield, a MLLM for FAS, along with the corresponding pre-training and
supervised fine-tuning (SFT) datasets, FaceShield-pre10K and FaceShield-sft45K.
FaceShield is capable of determining the authenticity of faces, identifying
types of spoofing attacks, providing reasoning for its judgments, and detecting
attack areas. Specifically, we employ spoof-aware vision perception (SAVP) that
incorporates both the original image and auxiliary information based on prior
knowledge. We then use an prompt-guided vision token masking (PVTM) strategy to
random mask vision tokens, thereby improving the model's generalization
ability. We conducted extensive experiments on three benchmark datasets,
demonstrating that FaceShield significantly outperforms previous deep learning
models and general MLLMs on four FAS tasks, i.e., coarse-grained
classification, fine-grained classification, reasoning, and attack
localization. Our instruction datasets, protocols, and codes will be released
soon.