Empowering Vision Transformers with Multi-Scale Causal Intervention for Long-Tailed Image Classification
Journal:
arXiv
Published Date:
May 13, 2025
Abstract
Causal inference has emerged as a promising approach to mitigate long-tail
classification by handling the biases introduced by class imbalance. However,
along with the change of advanced backbone models from Convolutional Neural
Networks (CNNs) to Visual Transformers (ViT), existing causal models may not
achieve an expected performance gain. This paper investigates the influence of
existing causal models on CNNs and ViT variants, highlighting that ViT's global
feature representation makes it hard for causal methods to model associations
between fine-grained features and predictions, which leads to difficulties in
classifying tail classes with similar visual appearance. To address these
issues, this paper proposes TSCNet, a two-stage causal modeling method to
discover fine-grained causal associations through multi-scale causal
interventions. Specifically, in the hierarchical causal representation learning
stage (HCRL), it decouples the background and objects, applying backdoor
interventions at both the patch and feature level to prevent model from using
class-irrelevant areas to infer labels which enhances fine-grained causal
representation. In the counterfactual logits bias calibration stage (CLBC), it
refines the optimization of model's decision boundary by adaptive constructing
counterfactual balanced data distribution to remove the spurious associations
in the logits caused by data distribution. Extensive experiments conducted on
various long-tail benchmarks demonstrate that the proposed TSCNet can eliminate
multiple biases introduced by data imbalance, which outperforms existing
methods.