MSA-MaxNet: Multi-Scale Attention Enhanced Multi-Axis Vision Transformer Network for Medical Image Segmentation.
Journal:
Journal of cellular and molecular medicine
Published Date:
Dec 1, 2024
Abstract
Convolutional neural networks (CNNs) are well established in handling local features in visual tasks; yet, they falter in managing complex spatial relationships and long-range dependencies that are crucial for medical image segmentation, particularly in identifying pathological changes. While vision transformer (ViT) excels in addressing long-range dependencies, their ability to leverage local features remains inadequate. Recent ViT variants have merged CNNs to improve feature representation and segmentation outcomes, yet challenges with limited receptive fields and precise feature representation persist. In this work, we propose MSA-MaxNet. Specifically, our model utilises an encoder-decoder structure, using MaxViT blocks that apply multi-axis self-attention (Max-SA) as the encoder for local and global feature extraction. To restore the feature map's spatial resolution during upsampling operations, a symmetric MaxViT block-based decoder and upsampling layers are employed. To address the feature mismatches in the skip connections of UNet architecture, we introduce convolutional block attention module (CBAM). Furthermore, we design a multi-scale convolutional block attention module (MCBAM) based on CBAM, which utilises multi-scale features to enhance feature representation and refine the skip connection. We evaluate the segmentation performance of MSA-MaxNet on three publicly available medical imaging datasets, including Synapse for multi-organ segmentation, ACDC for cardiac analysis and Kvasir-SEG for gastrointestinal polyp detection. Notably, MSA-MaxNet achieves state-of-the-art (SOTA) Dice scores of 85.59% and 95.26% on Synapse and Kvasir-SEG datasets, respectively, with 40.28 M parameters. Additionally, we introduce two smaller versions of MSA-MaxNet to meet the demands of various scenarios. In summary, our work provides a robust framework for diverse medical imaging tasks, offering potential applications in early cancer detection, cardiovascular disease diagnosis and comprehensive organ-level assessments.