MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion

Journal: arXiv

Published Date: May 19, 2025

Abstract

The combination of Spiking Neural Networks(SNNs) with Vision Transformer architectures has attracted significant attention due to the great potential for energy-efficient and high-performance computing paradigms. However, a substantial performance gap still exists between SNN-based and ANN-based transformer architectures. While existing methods propose spiking self-attention mechanisms that are successfully combined with SNNs, the overall architectures proposed by these methods suffer from a bottleneck in effectively extracting features from different image scales. In this paper, we address this issue and propose MSVIT, a novel spike-driven Transformer architecture, which firstly uses multi-scale spiking attention (MSSA) to enrich the capability of spiking attention blocks. We validate our approach across various main data sets. The experimental results show that MSVIT outperforms existing SNN-based models, positioning itself as a state-of-the-art solution among SNN-transformer architectures. The codes are available at https://github.com/Nanhu-AI-Lab/MSViT.

Authors

Wei Hua
Chenlin Zhou
Jibin Wu
Yansong Chua
Yangyang Shu

External Resources

View on arXiv arXiv (http://arxiv.org/abs/2505.14719v1)

MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion

Abstract

Authors

Categories

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals