Moderating Harm: Benchmarking Large Language Models for Cyberbullying Detection in YouTube Comments
Journal:
arXiv
Published Date:
May 25, 2025
Abstract
As online platforms grow, comment sections increasingly host harassment that
undermines user experience and well-being. This study benchmarks three leading
large language models, OpenAI GPT-4.1, Google Gemini 1.5 Pro, and Anthropic
Claude 3 Opus, on a corpus of 5,080 YouTube comments sampled from high-abuse
threads in gaming, lifestyle, food vlog, and music channels. The dataset
comprises 1,334 harmful and 3,746 non-harmful messages in English, Arabic, and
Indonesian, annotated independently by two reviewers with substantial agreement
(Cohen's kappa = 0.83). Using a unified prompt and deterministic settings,
GPT-4.1 achieved the best overall balance with an F1 score of 0.863, precision
of 0.887, and recall of 0.841. Gemini flagged the highest share of harmful
posts (recall = 0.875) but its precision fell to 0.767 due to frequent false
positives. Claude delivered the highest precision at 0.920 and the lowest
false-positive rate of 0.022, yet its recall dropped to 0.720. Qualitative
analysis showed that all three models struggle with sarcasm, coded insults, and
mixed-language slang. These results underscore the need for moderation
pipelines that combine complementary models, incorporate conversational
context, and fine-tune for under-represented languages and implicit abuse. A
de-identified version of the dataset and full prompts is publicly released to
promote reproducibility and further progress in automated content moderation.