A Multi-Modal Fusion Framework for Brain Tumor Segmentation Based on 3D Spatial-Language-Vision Integration and Bidirectional Interactive Attention Mechanism
Journal:
arXiv
Published Date:
Jul 11, 2025
Abstract
This study aims to develop a novel multi-modal fusion framework for brain
tumor segmentation that integrates spatial-language-vision information through
bidirectional interactive attention mechanisms to improve segmentation accuracy
and boundary delineation. Methods: We propose two core components: Multi-modal
Semantic Fusion Adapter (MSFA) integrating 3D MRI data with clinical text
descriptions through hierarchical semantic decoupling, and Bidirectional
Interactive Visual-semantic Attention (BIVA) enabling iterative information
exchange between modalities. The framework was evaluated on BraTS 2020 dataset
comprising 369 multi-institutional MRI scans. Results: The proposed method
achieved average Dice coefficient of 0.8505 and 95% Hausdorff distance of
2.8256mm across enhancing tumor, tumor core, and whole tumor regions,
outperforming state-of-the-art methods including SCAU-Net, CA-Net, and 3D
U-Net. Ablation studies confirmed critical contributions of semantic and
spatial modules to boundary precision. Conclusion: Multi-modal semantic fusion
combined with bidirectional interactive attention significantly enhances brain
tumor segmentation performance, establishing new paradigms for integrating
clinical knowledge into medical image analysis.