PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding
Journal:
arXiv
Published Date:
May 27, 2025
Abstract
Real-world objects are composed of distinctive, object-specific parts.
Identifying these parts is key to performing fine-grained, compositional
reasoning-yet, large multimodal models (LMMs) struggle to perform this
seemingly straightforward task. In this work, we introduce PARTONOMY, an LMM
benchmark designed for pixel-level part grounding. We construct PARTONOMY from
existing part datasets and our own rigorously annotated set of images,
encompassing 862 part labels and 534 object labels for evaluation. Unlike
existing datasets that simply ask models to identify generic parts, PARTONOMY
uses specialized concepts (e.g., agricultural airplane), and challenges models
to compare objects' parts, consider part-whole relationships, and justify
textual predictions with visual segmentations. Our experiments demonstrate
significant limitations in state-of-the-art LMMs (e.g., LISA-13B achieves only
5.9% gIoU), highlighting a critical gap in their part grounding abilities. We
note that existing segmentation-enabled LMMs (segmenting LMMs) have two key
architectural shortcomings: they use special [SEG] tokens not seen during
pretraining which induce distribution shift, and they discard predicted
segmentations instead of using past predictions to guide future ones. To
address these deficiencies, we train several part-centric LMMs and propose
PLUM, a novel segmenting LMM that uses span tagging instead of segmentation
tokens and that conditions on prior predictions in a feedback loop. We find
that pretrained PLUM outperforms existing segmenting LMMs on reasoning
segmentation, VQA, and visual hallucination benchmarks. In addition, PLUM
finetuned on our proposed Explanatory Part Segmentation task is competitive
with segmenting LMMs trained on significantly more segmentation data. Our work
opens up new avenues towards enabling fine-grained, grounded visual
understanding in LMMs.