Zeus: Zero-shot LLM Instruction for Union Segmentation in Multimodal Medical Imaging
Journal:
arXiv
Published Date:
Apr 9, 2025
Abstract
Medical image segmentation has achieved remarkable success through the
continuous advancement of UNet-based and Transformer-based foundation
backbones. However, clinical diagnosis in the real world often requires
integrating domain knowledge, especially textual information. Conducting
multimodal learning involves visual and text modalities shown as a solution,
but collecting paired vision-language datasets is expensive and time-consuming,
posing significant challenges. Inspired by the superior ability in numerous
cross-modal tasks for Large Language Models (LLMs), we proposed a novel
Vision-LLM union framework to address the issues. Specifically, we introduce
frozen LLMs for zero-shot instruction generation based on corresponding medical
images, imitating the radiology scanning and report generation process. {To
better approximate real-world diagnostic processes}, we generate more precise
text instruction from multimodal radiology images (e.g., T1-w or T2-w MRI and
CT). Based on the impressive ability of semantic understanding and rich
knowledge of LLMs. This process emphasizes extracting special features from
different modalities and reunion the information for the ultimate clinical
diagnostic. With generated text instruction, our proposed union segmentation
framework can handle multimodal segmentation without prior collected
vision-language datasets. To evaluate our proposed method, we conduct
comprehensive experiments with influential baselines, the statistical results
and the visualized case study demonstrate the superiority of our novel method.}