Automated violence monitoring system for real-time fistfight detection using deep learning-based temporal action localization.
Journal:
Scientific reports
Published Date:
Aug 12, 2025
Abstract
Fistfight detection in video data is a critical task in video surveillance systems, where identifying physical altercations in real-time can enhance safety and security in public spaces. Earlier techniques primarily emphasized capturing inter-person interactions and combining individual characteristics into group-based representations, often overlooking the critical intra-person dynamics within the human bodypose point framework. However, essential individual features can be extracted by examining human skeletal movements' progression and temporal patterns. This paper presents a novel multimodal spatio-temporal fistfight detection model (MSTFDet) that integrates RGB images and human skeletal data to identify violent behaviors accurately. The proposed framework leverages both Context-Aware Encoded Transformer (CAET) for modeling interactions between individuals and their environment and Spatial-Temporal Graph Convolutional Networks (ST-GCN) for capturing intra-person and inter-person dynamics from skeletal data. The RGB module uses a combination of spatial and temporal transformers to model contextual relationships and individual actions, while the bodypose-point module processes skeletal data to capture the fine-grained motion of individuals. We conduct evaluations on two public datasets: the Surveillance Camera Fight Dataset (SCFD) and the RWF-2000 dataset, which feature complex real-world scenarios. On the SCFD and RWF-2000 datasets, MSTFDet achieved a Multi-class classification accuracy (MCA) of 92.3% and 95.2% MCA, respectively. These results highlight the effectiveness of the proposed approach in capturing both spatial and temporal features, providing a robust solution for real-time fistfight detection in diverse and challenging environments.
Authors
Keywords
No keywords available for this article.