LlamaSeg: Image Segmentation via Autoregressive Mask Generation

Journal: arXiv

Published Date: May 26, 2025

Abstract

We present LlamaSeg, a visual autoregressive framework that unifies multiple image segmentation tasks via natural language instructions. We reformulate image segmentation as a visual generation problem, representing masks as "visual" tokens and employing a LLaMA-style Transformer to predict them directly from image inputs. By adhering to the next-token prediction paradigm, our approach naturally integrates segmentation tasks into autoregressive architectures. To support large-scale training, we introduce a data annotation pipeline and construct the SA-OVRS dataset, which contains 2M segmentation masks annotated with over 5,800 open-vocabulary labels or diverse textual descriptions, covering a wide spectrum of real-world scenarios. This enables our model to localize objects in images based on text prompts and to generate fine-grained masks. To more accurately evaluate the quality of masks produced by visual generative models, we further propose a composite metric that combines Intersection over Union (IoU) with Average Hausdorff Distance (AHD), offering a more precise assessment of contour fidelity. Experimental results demonstrate that our method surpasses existing generative models across multiple datasets and yields more detailed segmentation masks.

Authors

Jiru Deng
Tengjin Weng
Tianyu Yang
Wenhan Luo
Zhiheng Li
Wenhao Jiang

External Resources

View on arXiv arXiv (http://arxiv.org/abs/2505.19422v1)

LlamaSeg: Image Segmentation via Autoregressive Mask Generation

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

LlamaSeg: Image Segmentation via Autoregressive Mask Generation

Abstract

Authors

Categories

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals