Enhanced Multimodal Hate Video Detection via Channel-wise and Modality-wise Fusion
Journal:
arXiv
Published Date:
May 17, 2025
Abstract
The rapid rise of video content on platforms such as TikTok and YouTube has
transformed information dissemination, but it has also facilitated the spread
of harmful content, particularly hate videos. Despite significant efforts to
combat hate speech, detecting these videos remains challenging due to their
often implicit nature. Current detection methods primarily rely on unimodal
approaches, which inadequately capture the complementary features across
different modalities. While multimodal techniques offer a broader perspective,
many fail to effectively integrate temporal dynamics and modality-wise
interactions essential for identifying nuanced hate content. In this paper, we
present CMFusion, an enhanced multimodal hate video detection model utilizing a
novel Channel-wise and Modality-wise Fusion Mechanism. CMFusion first extracts
features from text, audio, and video modalities using pre-trained models and
then incorporates a temporal cross-attention mechanism to capture dependencies
between video and audio streams. The learned features are then processed by
channel-wise and modality-wise fusion modules to obtain informative
representations of videos. Our extensive experiments on a real-world dataset
demonstrate that CMFusion significantly outperforms five widely used baselines
in terms of accuracy, precision, recall, and F1 score. Comprehensive ablation
studies and parameter analyses further validate our design choices,
highlighting the model's effectiveness in detecting hate videos. The source
codes will be made publicly available at https://github.com/EvelynZ10/cmfusion.