TK-Mamba: Marrying KAN with Mamba for Text-Driven 3D Medical Image Segmentation
Journal:
arXiv
Published Date:
May 24, 2025
Abstract
3D medical image segmentation is vital for clinical diagnosis and treatment
but is challenged by high-dimensional data and complex spatial dependencies.
Traditional single-modality networks, such as CNNs and Transformers, are often
limited by computational inefficiency and constrained contextual modeling in 3D
settings. We introduce a novel multimodal framework that leverages Mamba and
Kolmogorov-Arnold Networks (KAN) as an efficient backbone for long-sequence
modeling. Our approach features three key innovations: First, an EGSC (Enhanced
Gated Spatial Convolution) module captures spatial information when unfolding
3D images into 1D sequences. Second, we extend Group-Rational KAN (GR-KAN), a
Kolmogorov-Arnold Networks variant with rational basis functions, into
3D-Group-Rational KAN (3D-GR-KAN) for 3D medical imaging - its first
application in this domain - enabling superior feature representation tailored
to volumetric data. Third, a dual-branch text-driven strategy leverages CLIP's
text embeddings: one branch swaps one-hot labels for semantic vectors to
preserve inter-organ semantic relationships, while the other aligns images with
detailed organ descriptions to enhance semantic alignment. Experiments on the
Medical Segmentation Decathlon (MSD) and KiTS23 datasets show our method
achieving state-of-the-art performance, surpassing existing approaches in
accuracy and efficiency. This work highlights the power of combining advanced
sequence modeling, extended network architectures, and vision-language synergy
to push forward 3D medical image segmentation, delivering a scalable solution
for clinical use. The source code is openly available at
https://github.com/yhy-whu/TK-Mamba.