Unveiling the Potential of Diffusion Large Language Model in Controllable Generation
Journal:
arXiv
Published Date:
Jul 6, 2025
Abstract
Diffusion models, originally developed for image generation, have emerged as
a promising alternative to autoregressive large language models (LLMs). We
present a theoretical analysis comparing autoregressive and masked diffusion
LLMs, revealing that the intrinsic bidirectional attention mechanism of
diffusion LLMs (dLLMs) enables superior context modeling and generation
controllability. However, existing dLLM applications face significant
challenges in controllable generation: the native multi-step denoising process
exhibits high sensitivity to sequence length, elevated hallucination rates, and
prohibitive inference costs without specialized optimizations. To address these
limitations, we propose \textbf{S}elf-adaptive \textbf{S}chema
\textbf{S}caffolding ($S^3$), a novel framework that enables dLLMs to generate
structured outputs (e.g., JSON) while maintaining semantic fidelity and
accelerating inference. Our approach injects the target schema structure into
the output context, reducing unnecessary computation while improving
controllability. Extensive experiments demonstrate that $S^3$ achieves
substantial improvements: 65\% increase in structural adherence, 48\%
enhancement in content fidelity, and 17\% reduction in hallucination rates
compared to baseline. These results establish both theoretical foundations and
practical pathways for deploying diffusion models in controllable text
generation tasks. Code and data will be publicly released.