CoT defender: Preemptive chain-of-thought occupation for jailbreak attack mitigation.
Journal:
Neural networks : the official journal of the International Neural Network Society
Published Date:
Jan 14, 2026
Abstract
With the development of large language models (LLMs), numerous studies have demonstrated their vulnerability to carefully crafted jailbreak attacks. However, existing mitigation measures rarely balance model usability with significant protective effects, raising concerns about model abuse. To address this, we introduce CoT Defender. It preemptively occupies the model's first few generated tokens with a chain-of-thought analysis that hinders attackers from steering the output towards harmful content. We designed a two-stage training framework that strengthens security while preserving usability. Stage 1 fine-tunes the model to follow a structured chain-of-thought format before answering. Stage 2 employs reinforcement learning to refine this reasoning. An auxiliary attacker model continuously synthesizes new jailbreak prompts, and a lightweight evaluator-Probabilistic Structured Output Evaluation (PSOE)-supplies fine-grained rewards by scoring both sentence-level intent capture and token-level format fidelity. We conducted a series of experiments on four models and six attack methods. Across all models, we successfully reduced the average attack success rate to below 8.0 %, with no more than a 7.0 % impact on the response rate for benign requests. Code is available here. Warning: This paper contains red-team data that may be offensive!
Authors
Keywords
No keywords available for this article.