Hierarchical Mask-Enhanced Dual Reconstruction Network for Few-Shot Fine-Grained Image Classification
Journal:
arXiv
Published Date:
Jun 25, 2025
Abstract
Few-shot fine-grained image classification (FS-FGIC) presents a significant
challenge, requiring models to distinguish visually similar subclasses with
limited labeled examples. Existing methods have critical limitations:
metric-based methods lose spatial information and misalign local features,
while reconstruction-based methods fail to utilize hierarchical feature
information and lack mechanisms to focus on discriminative regions. We propose
the Hierarchical Mask-enhanced Dual Reconstruction Network (HMDRN), which
integrates dual-layer feature reconstruction with mask-enhanced feature
processing to improve fine-grained classification. HMDRN incorporates a
dual-layer feature reconstruction and fusion module that leverages
complementary visual information from different network hierarchies. Through
learnable fusion weights, the model balances high-level semantic
representations from the last layer with mid-level structural details from the
penultimate layer. Additionally, we design a spatial binary mask-enhanced
transformer self-reconstruction module that processes query features through
adaptive thresholding while maintaining complete support features, enhancing
focus on discriminative regions while filtering background noise. Extensive
experiments on three challenging fine-grained datasets demonstrate that HMDRN
consistently outperforms state-of-the-art methods across Conv-4 and ResNet-12
backbone architectures. Comprehensive ablation studies validate the
effectiveness of each proposed component, revealing that dual-layer
reconstruction enhances inter-class discrimination while mask-enhanced
transformation reduces intra-class variations. Visualization results provide
evidence of HMDRN's superior feature reconstruction capabilities.