Frequency-Assisted Local Attention in Lower Layers of Visual Transformers.

Journal: International journal of neural systems

PMID: 40016195

Abstract

Since vision transformers excel at establishing global relationships between features, they play an important role in current vision tasks. However, the global attention mechanism restricts the capture of local features, making convolutional assistance necessary. This paper indicates that transformer-based models can attend to local information without using convolutional blocks, similar to convolutional kernels, by employing a special initialization method. Therefore, this paper proposes a novel hybrid multi-scale model called Frequency-Assisted Local Attention Transformer (FALAT). FALAT introduces a Frequency-Assisted Window-based Positional Self-Attention (FWPSA) module that limits the attention distance of query tokens, enabling the capture of local contents in the early stage. The information from value tokens in the frequency domain enhances information diversity during self-attention computation. Additionally, the traditional convolutional method is replaced with a depth-wise separable convolution to downsample in the spatial reduction attention module for long-distance contents in the later stages. Experimental results demonstrate that FALAT-S achieves 83.0% accuracy on IN-1k with an input size of [Formula: see text] using 29.9[Formula: see text]M parameters and 5.6[Formula: see text]G FLOPs. This model outperforms the Next-ViT-S by 0.9[Formula: see text]AP/0.8[Formula: see text]AP with Mask-R-CNN [Formula: see text] on COCO and surpasses the recent FastViT-SA36 by 3.1% mIoU with FPN on ADE20k.

Authors

Xin Zhou

School of Mechatronic Engineering, China University of Mining & Technology, Xuzhou 221116, China.
Zeyu Jiang

State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China.
Shihua Zhou
Zhaohui Ren

School of Mechanical Engineering and Automation, Northeastern University, Wenhua Road, Shen Yang, Liao Ning, P. R. China.
Yongchao Zhang

School of Mechanical Engineering and Automation, Northeastern University, Wenhua Road, Shen Yang, Liao Ning, P. R. China.
Tianzhuang Yu

School of Mechanical Engineering and Automation, Northeastern University, Wenhua Road, Shen Yang, Liao Ning, P. R. China.
Yulin Liu

Department of Radiology, Hubei Cancer Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China.

Keywords

Attention Humans Neural Networks, Computer Visual Perception

External Resources

View on PubMed Access via DOI PubMed (40016195)

Frequency-Assisted Local Attention in Lower Layers of Visual Transformers.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals