An explainable deep learning framework for video violence detection using unsupervised keyframe selection and attention-based CNN.

Journal: Scientific reports

Published Date: Feb 26, 2026

Abstract

The exponential growth of video data from surveillance and online platforms has heightened the demand for intelligent, explainable systems capable of detecting violence in real time. This study proposes a novel Explainable Attention-Enhanced Convolutional Neural Network (CNN) framework that integrates unsupervised keyframe selection, attention-driven feature learning, and Grad-CAM++-based interpretability to address redundancy, transparency, and generalization challenges in video violence detection. The proposed model automatically extracts representative keyframes using similarity-based clustering, reducing computational overhead while retaining essential temporal information. Attention modules are embedded within the CNN backbone to enhance spatial-temporal feature discrimination, while Grad-CAM + + provides interpretable visual insights into the model's decision process. Comprehensive experiments on five benchmark datasets-RLVS, Hockey Fight, Violent Flow, ShanghaiTech, and UCF-Crime-demonstrate that the framework achieves superior performance, with an average accuracy of 94.6% and F1-score of 93.9%, outperforming state-of-the-art models such as C3D, I3D, ResNet-LSTM, and ViViT. The model also delivers near-real-time efficiency (≈ 62 FPS) with reduced memory utilization (6.8 GB), confirming its suitability for deployment in surveillance and public safety systems. Statistical analysis using ANOVA and Tukey's HSD tests verified that keyframe selection and attention modules significantly improve performance (p < 0.05) with large effect sizes (η² = 0.76). The integration of interpretability further enhances reliability by localizing violence-relevant regions in frames. Overall, the proposed explainable framework establishes a robust, efficient, and transparent solution for automated violence detection in diverse real-world scenarios.

An explainable deep learning framework for video violence detection using unsupervised keyframe selection and attention-based CNN.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

An explainable deep learning framework for video violence detection using unsupervised keyframe selection and attention-based CNN.

Abstract

Authors

Keywords

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals