Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models
Journal:
arXiv
Published Date:
Mar 26, 2025
Abstract
Recent advancements in autoregressive and diffusion models have led to strong
performance in image generation with short scene text words. However,
generating coherent, long-form text in images, such as paragraphs in slides or
documents, remains a major challenge for current generative models. We present
the first work specifically focused on long text image generation, addressing a
critical gap in existing text-to-image systems that typically handle only brief
phrases or single sentences. Through comprehensive analysis of state-of-the-art
autoregressive generation models, we identify the image tokenizer as a critical
bottleneck in text generating quality. To address this, we introduce a novel
text-focused, binary tokenizer optimized for capturing detailed scene text
features. Leveraging our tokenizer, we develop \ModelName, a multimodal
autoregressive model that excels in generating high-quality long-text images
with unprecedented fidelity. Our model offers robust controllability, enabling
customization of text properties such as font style, size, color, and
alignment. Extensive experiments demonstrate that \ModelName~significantly
outperforms SD3.5 Large~\cite{sd3} and GPT4o~\cite{gpt4o} with DALL-E
3~\cite{dalle3} in generating long text accurately, consistently, and flexibly.
Beyond its technical achievements, \ModelName~opens up exciting opportunities
for innovative applications like interleaved document and PowerPoint
generation, establishing a new frontier in long-text image generating.