E-SAM: Training-Free Segment Every Entity Model
Journal:
arXiv
Published Date:
Mar 15, 2025
Abstract
Entity Segmentation (ES) aims at identifying and segmenting distinct entities
within an image without the need for predefined class labels. This
characteristic makes ES well-suited to open-world applications with adaptation
to diverse and dynamically changing environments, where new and previously
unseen entities may appear frequently. Existing ES methods either require large
annotated datasets or high training costs, limiting their scalability and
adaptability. Recently, the Segment Anything Model (SAM), especially in its
Automatic Mask Generation (AMG) mode, has shown potential for holistic image
segmentation. However, it struggles with over-segmentation and
under-segmentation, making it less effective for ES. In this paper, we
introduce E-SAM, a novel training-free framework that exhibits exceptional ES
capability. Specifically, we first propose Multi-level Mask Generation (MMG)
that hierarchically processes SAM's AMG outputs to generate reliable
object-level masks while preserving fine details at other levels. Entity-level
Mask Refinement (EMR) then refines these object-level masks into accurate
entity-level masks. That is, it separates overlapping masks to address the
redundancy issues inherent in SAM's outputs and merges similar masks by
evaluating entity-level consistency. Lastly, Under-Segmentation Refinement
(USR) addresses under-segmentation by generating additional high-confidence
masks fused with EMR outputs to produce the final ES map. These three modules
are seamlessly optimized to achieve the best ES without additional training
overhead. Extensive experiments demonstrate that E-SAM achieves
state-of-the-art performance compared to prior ES methods, demonstrating a
significant improvement by +30.1 on benchmark metrics.