LlamaSeg: Image Segmentation via Autoregressive Mask Generation
Journal:
arXiv
Published Date:
May 26, 2025
Abstract
We present LlamaSeg, a visual autoregressive framework that unifies multiple
image segmentation tasks via natural language instructions. We reformulate
image segmentation as a visual generation problem, representing masks as
"visual" tokens and employing a LLaMA-style Transformer to predict them
directly from image inputs. By adhering to the next-token prediction paradigm,
our approach naturally integrates segmentation tasks into autoregressive
architectures. To support large-scale training, we introduce a data annotation
pipeline and construct the SA-OVRS dataset, which contains 2M segmentation
masks annotated with over 5,800 open-vocabulary labels or diverse textual
descriptions, covering a wide spectrum of real-world scenarios. This enables
our model to localize objects in images based on text prompts and to generate
fine-grained masks. To more accurately evaluate the quality of masks produced
by visual generative models, we further propose a composite metric that
combines Intersection over Union (IoU) with Average Hausdorff Distance (AHD),
offering a more precise assessment of contour fidelity. Experimental results
demonstrate that our method surpasses existing generative models across
multiple datasets and yields more detailed segmentation masks.