Improving Contrastive Learning for Referring Expression Counting
Journal:
arXiv
Published Date:
May 28, 2025
Abstract
Object counting has progressed from class-specific models, which count only
known categories, to class-agnostic models that generalize to unseen
categories. The next challenge is Referring Expression Counting (REC), where
the goal is to count objects based on fine-grained attributes and contextual
differences. Existing methods struggle with distinguishing visually similar
objects that belong to the same category but correspond to different referring
expressions. To address this, we propose C-REX, a novel contrastive learning
framework, based on supervised contrastive learning, designed to enhance
discriminative representation learning. Unlike prior works, C-REX operates
entirely within the image space, avoiding the misalignment issues of image-text
contrastive learning, thus providing a more stable contrastive signal. It also
guarantees a significantly larger pool of negative samples, leading to improved
robustness in the learned representations. Moreover, we showcase that our
framework is versatile and generic enough to be applied to other similar tasks
like class-agnostic counting. To support our approach, we analyze the key
components of sota detection-based models and identify that detecting object
centroids instead of bounding boxes is the key common factor behind their
success in counting tasks. We use this insight to design a simple yet effective
detection-based baseline to build upon. Our experiments show that C-REX
achieves state-of-the-art results in REC, outperforming previous methods by
more than 22\% in MAE and more than 10\% in RMSE, while also demonstrating
strong performance in class-agnostic counting. Code is available at
https://github.com/cvlab-stonybrook/c-rex.