Multi-scale fusion semantic enhancement network for medical image segmentation.
Journal:
Scientific reports
Published Date:
Jul 2, 2025
Abstract
The application of sophisticated computer vision techniques for medical image segmentation (MIS) plays a vital role in clinical diagnosis and treatment. Although Transformer-based models are effective at capturing global context, they are often ineffective at dealing with local feature dependencies. In order to improve this problem, we design a Multi-scale Fusion and Semantic Enhancement Network (MFSE-Net) for endoscopic image segmentation, which aims to capture global information and enhance detailed information. MFSE-Net uses a dual encoder architecture, with PVTv2 as the primary encoder to capture global features and CNNs as the secondary encoder to capture local details. The main encoder includes the LGDA (Large-kernel Grouped Deformable Attention) module for filtering noise and enhancing the semantic extraction of the four hierarchical features. The auxiliary encoder leverages the MLCF (Multi-Layered Cross-attention Fusion) module to integrate high-level semantic data from the deep CNN with fine spatial details from the shallow layers, enhancing the precision of boundaries and positioning. On the decoder side, we have introduced the PSE (Parallel Semantic Enhancement) module, which embeds the boundary and position information of the secondary encoder into the output characteristics of the backbone network. In the multi-scale decoding process, we also add SAM (Scale Aware Module) to recover global semantic information and offset for the loss of boundary details. Extensive experiments have shown that MFSE-Net overwhelmingly outperforms SOTA on the renal tumor and polyp datasets.