A lightweight convolutional neural network architecture for violence detection in video sequences.
Journal:
Scientific reports
Published Date:
Feb 6, 2026
Abstract
The escalation of violent incidents in high-density public environments such as political assemblies, concerts, and sports arenas necessitates the development of computationally efficient and accurate real-time violence detection frameworks. Prompt identification of aggressive events from continuous surveillance video streams is critical for initiating rapid countermeasures. However, the task is inherently complex due to spatiotemporal scene variations, illumination inconsistencies, and the intensive computational cost of processing high-dimensional video data. This study introduces a lightweight deep convolutional neural network (CNN) architecture derived from MobileNetV2, optimized through depthwise separable convolutions and inverted residual bottlenecks to achieve significant parameter reduction without compromising classification efficacy. The proposed framework processes video streams by extracting and preprocessing frames (224 × 224 resolution, normalization, augmentation) to enhance generalization and mitigate overfitting. The model was trained and evaluated on two benchmark datasets: the Real-Life Violence Situations Dataset (RLVSD) and the Hockey Fight Dataset (HFD), encompassing balanced classes of violent and non-violent sequences. Empirical evaluation indicates superior performance, attaining 97% accuracy on RLVSD and 94% on HFD, with corresponding gains in precision, recall, and F1-score compared to conventional CNN architectures. Computational profiling confirms substantial efficiency improvements, enabling inference at real-time frame rates on resource-constrained hardware. The proposed methodology demonstrates that optimized lightweight architectures can deliver high-accuracy violence detection while significantly reducing computational overhead. These characteristics make the approach highly deployable in real-world surveillance systems. Future research will focus on temporal feature integration via 3D CNNs or transformer-based models and cross-domain adaptability to heterogeneous video sources.
Authors
Keywords
No keywords available for this article.