A comprehensive framework for multi-modal hate speech detection in social media using deep learning.
Journal:
Scientific reports
PMID:
40234479
Abstract
As social media platforms evolve, hate speech increasingly manifests across multiple modalities, including text, images, audio, and video, challenging traditional detection systems focused on single modalities. Hence, this research proposes a novel Multi-modal Hate Speech Detection Framework (MHSDF) that combines Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to analyze complex, heterogeneous data streams. This hybrid approach leverages CNNs for spatial feature extraction, such as identifying visual cues in images and local text patterns, and Long Short Term Memory (LSTM) for modeling temporal dependencies and sequential information in text and audio. For textual content, utilize state-of-the-art word embeddings, including Word2Vec and BERT, to capture semantic relationships and contextual nuances. The framework integrates CNNs to extract n-gram patterns and RNNs to model long-range dependency up to sequences of up to 100 tokens. CNNs extract key spatial features in visual tasks, while LSTMs process video sequences to capture evolving visual patterns. Image spatial features refer to object localization, color distributions, and text extracted via Optical Character Recognition (OCR). The fusion mechanism employs attention mechanisms to prioritize key interactions between modalities, enabling the detection of nuanced hate speech, such as memes that blend offensive imagery with implicit text, sarcastic videos where toxicity is conveyed through tone and facial expressions, and multi-layered content that embeds discriminatory meaning, across different formats. The numerical findings show that the proposed MHSDF model increases the detection accuracy ratio of 98.53%, robustness ratio of 97.64%, interpretability ratio of 97.71%, scalability ratio of 98.67%, and performance ratio of 99.21% compared to other existing models. Furthermore, the model's interpretability is enhanced through attention-based explanations, which provide insights into how multi-modal hate speech is identified. The framework improves traceability of decisions, interpretability by modality, and overall transparency.