MoRe: Class Patch Attention Needs Regularization for Weakly Supervised Semantic Segmentation
Journal:
arXiv
Published Date:
Dec 15, 2024
Abstract
Weakly Supervised Semantic Segmentation (WSSS) with image-level labels
typically uses Class Activation Maps (CAM) to achieve dense predictions.
Recently, Vision Transformer (ViT) has provided an alternative to generate
localization maps from class-patch attention. However, due to insufficient
constraints on modeling such attention, we observe that the Localization
Attention Maps (LAM) often struggle with the artifact issue, i.e., patch
regions with minimal semantic relevance are falsely activated by class tokens.
In this work, we propose MoRe to address this issue and further explore the
potential of LAM. Our findings suggest that imposing additional regularization
on class-patch attention is necessary. To this end, we first view the attention
as a novel directed graph and propose the Graph Category Representation module
to implicitly regularize the interaction among class-patch entities. It ensures
that class tokens dynamically condense the related patch information and
suppress unrelated artifacts at a graph level. Second, motivated by the
observation that CAM from classification weights maintains smooth localization
of objects, we devise the Localization-informed Regularization module to
explicitly regularize the class-patch attention. It directly mines the token
relations from CAM and further supervises the consistency between class and
patch tokens in a learnable manner. Extensive experiments are conducted on
PASCAL VOC and MS COCO, validating that MoRe effectively addresses the artifact
issue and achieves state-of-the-art performance, surpassing recent single-stage
and even multi-stage methods. Code is available at
https://github.com/zwyang6/MoRe.