Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation

Journal: arXiv

Published Date: Jul 11, 2025

Abstract

Leveraging the powerful representations of pre-trained vision foundation models -- traditionally used for visual comprehension -- we explore a novel direction: building an image tokenizer directly atop such models, a largely underexplored area. Specifically, we employ a frozen vision foundation model as the encoder of our tokenizer. To enhance its effectiveness, we introduce two key components: (1) a region-adaptive quantization framework that reduces redundancy in the pre-trained features on regular 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer's outputs with the foundation model's representations to preserve semantic fidelity. Based on these designs, our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality, while also enhancing token efficiency. It further boosts autoregressive (AR) generation -- achieving a gFID of 2.07 on ImageNet benchmarks, while accelerating model convergence by three times, and enabling high-fidelity class-conditional synthesis without the need for classifier-free guidance (CFG). The code will be released publicly to benefit the community.

Authors

Anlin Zheng
Xin Wen
Xuanyang Zhang
Chuofan Ma
Tiancai Wang
Gang Yu
Xiangyu Zhang
Xiaojuan Qi

External Resources

View on arXiv arXiv (http://arxiv.org/abs/2507.08441v1)

Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation

Abstract

Authors

Categories

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals