MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion
Journal:
arXiv
Published Date:
May 19, 2025
Abstract
The combination of Spiking Neural Networks(SNNs) with Vision Transformer
architectures has attracted significant attention due to the great potential
for energy-efficient and high-performance computing paradigms. However, a
substantial performance gap still exists between SNN-based and ANN-based
transformer architectures. While existing methods propose spiking
self-attention mechanisms that are successfully combined with SNNs, the overall
architectures proposed by these methods suffer from a bottleneck in effectively
extracting features from different image scales. In this paper, we address this
issue and propose MSVIT, a novel spike-driven Transformer architecture, which
firstly uses multi-scale spiking attention (MSSA) to enrich the capability of
spiking attention blocks. We validate our approach across various main data
sets. The experimental results show that MSVIT outperforms existing SNN-based
models, positioning itself as a state-of-the-art solution among SNN-transformer
architectures. The codes are available at
https://github.com/Nanhu-AI-Lab/MSViT.