Hita: Holistic Tokenizer for Autoregressive Image Generation
Journal:
arXiv
Published Date:
Jul 3, 2025
Abstract
Vanilla autoregressive image generation models generate visual tokens
step-by-step, limiting their ability to capture holistic relationships among
token sequences. Moreover, because most visual tokenizers map local image
patches into latent tokens, global information is limited. To address this, we
introduce \textit{Hita}, a novel image tokenizer for autoregressive (AR) image
generation. It introduces a holistic-to-local tokenization scheme with
learnable holistic queries and local patch tokens. Hita incorporates two key
strategies to better align with the AR generation process: 1) {arranging} a
sequential structure with holistic tokens at the beginning, followed by
patch-level tokens, and using causal attention to maintain awareness of
previous tokens; and 2) adopting a lightweight fusion module before feeding the
de-quantized tokens into the decoder to control information flow and prioritize
holistic tokens. Extensive experiments show that Hita accelerates the training
speed of AR generators and outperforms those trained with vanilla tokenizers,
achieving \textbf{2.59 FID} and \textbf{281.9 IS} on the ImageNet benchmark.
Detailed analysis of the holistic representation highlights its ability to
capture global image properties, such as textures, materials, and shapes.
Additionally, Hita also demonstrates effectiveness in zero-shot style transfer
and image in-painting. The code is available at
\href{https://github.com/CVMI-Lab/Hita}{https://github.com/CVMI-Lab/Hita}.