DiagSWin: A multi-scale vision transformer with diagonal-shaped windows for object detection and segmentation.

Journal: Neural networks : the official journal of the International Neural Network Society

PMID: 39191126

Abstract

Recently, Vision Transformer and its variants have demonstrated remarkable performance on various computer vision tasks, thanks to its competence in capturing global visual dependencies through self-attention. However, global self-attention suffers from high computational cost due to quadratic computational overhead, especially for the high-resolution vision tasks (e.g., object detection and semantic segmentation). Many recent works have attempted to reduce the cost by applying fine-grained local attention, but these approaches cripple the long-range modeling power of the original self-attention mechanism. Furthermore, these approaches usually have similar receptive fields within each layer, thus limiting the ability of each self-attention layer to capture multi-scale features, resulting in performance degradation when handling images with objects of different scales. To address these issues, we develop the Diagonal-shaped Window (DiagSWin) attention mechanism for modeling attentions in diagonal regions at hybrid scales per attention layer. The key idea of DiagSWin attention is to inject multi-scale receptive field sizes into tokens: before computing the self-attention matrix, each token attends its closest surrounding tokens at fine granularity and the tokens far away at coarse granularity. This mechanism is able to effectively capture multi-scale context information while reducing computational complexity. With DiagSwin attention, we present a new variant of Vision Transformer models, called DiagSWin Transformers, and demonstrate their superiority in extensive experiments across various tasks. Specifically, the DiagSwin Transformer with a large size achieves 84.4% Top-1 accuracy and outperforms the SOTA CSWin Transformer on ImageNet with 40% fewer model size and computation cost. When employed as backbones, DiagSWin Transformers achieve significant improvements over the current SOTA modules. In addition, our DiagSWin-Base model yields 51.1 box mAP and 45.8 mask mAP on COCO for object detection and segmentation, and 52.3 mIoU on the ADE20K for semantic segmentation.

Authors

Ke Li

School of Ideological and Political Education, Shanghai Maritime University, Shanghai, China.
Di Wang

Center for Endocrine Metabolism and Immune Diseases, Beijing Luhe Hospital, Capital Medical University, Beijing, People's Republic of China.
Gang Liu

Department of Interventional Radiology, Qinghai Red Cross Hospital, Xining, Qinghai, China.
Wenxuan Zhu

Key Laboratory of Smart Human-Computer Interaction and Wearable Technology of Shaanxi Province, Xidian University, Xi'an, 710071, China.
Haodi Zhong

Key Laboratory of Smart Human-Computer Interaction and Wearable Technology of Shaanxi Province, Xidian University, Xi'an, 710071, China.
Quan Wang

Laboratory of Surgical Oncology, Peking University People's Hospital, Peking University, Beijing, China.

Keywords

Algorithms Attention Humans Neural Networks, Computer Visual Perception

External Resources

View on PubMed Access via DOI PubMed (39191126)

DiagSWin: A multi-scale vision transformer with diagonal-shaped windows for object detection and segmentation.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals