DuoFormer: Leveraging Hierarchical Representations by Local and Global Attention Vision Transformer
Journal:
arXiv
Published Date:
Jun 15, 2025
Abstract
Despite the widespread adoption of transformers in medical applications, the
exploration of multi-scale learning through transformers remains limited, while
hierarchical representations are considered advantageous for computer-aided
medical diagnosis. We propose a novel hierarchical transformer model that
adeptly integrates the feature extraction capabilities of Convolutional Neural
Networks (CNNs) with the advanced representational potential of Vision
Transformers (ViTs). Addressing the lack of inductive biases and dependence on
extensive training datasets in ViTs, our model employs a CNN backbone to
generate hierarchical visual representations. These representations are adapted
for transformer input through an innovative patch tokenization process,
preserving the inherited multi-scale inductive biases. We also introduce a
scale-wise attention mechanism that directly captures intra-scale and
inter-scale associations. This mechanism complements patch-wise attention by
enhancing spatial understanding and preserving global perception, which we
refer to as local and global attention, respectively. Our model significantly
outperforms baseline models in terms of classification accuracy, demonstrating
its efficiency in bridging the gap between Convolutional Neural Networks (CNNs)
and Vision Transformers (ViTs). The components are designed as plug-and-play
for different CNN architectures and can be adapted for multiple applications.
The code is available at https://github.com/xiaoyatang/DuoFormer.git.