VIGIL: Vision-Language Guided Multiple Instance Learning Framework for Ulcerative Colitis Histological Healing Prediction
Journal:
arXiv
Published Date:
May 14, 2025
Abstract
Objective: Ulcerative colitis (UC), characterized by chronic inflammation
with alternating remission-relapse cycles, requires precise histological
healing (HH) evaluation to improve clinical outcomes. To overcome the
limitations of annotation-intensive deep learning methods and suboptimal
multi-instance learning (MIL) in HH prediction, we propose VIGIL, the first
vision-language guided MIL framework integrating white light endoscopy (WLE)
and endocytoscopy (EC). Methods:VIGIL begins with a dual-branch MIL module
KS-MIL based on top-K typical frames selection and similarity metric adaptive
learning to learn relationships among frame features effectively. By
integrating the diagnostic report text and specially designed multi-level
alignment and supervision between image-text pairs, VIGIL establishes joint
image-text guidance during training to capture richer disease-related semantic
information. Furthermore, VIGIL employs a multi-modal masked relation fusion
(MMRF) strategy to uncover the latent diagnostic correlations of two endoscopic
image representations. Results:Comprehensive experiments on a real-world
clinical dataset demonstrate VIGIL's superior performance, achieving 92.69\%
accuracy and 94.79\% AUC, outperforming existing state-of-the-art methods.
Conclusion: The proposed VIGIL framework successfully establishes an effective
vision-language guided MIL paradigm for UC HH prediction, reducing annotation
burdens while improving prediction reliability. Significance: The research
outcomes provide new insights for non-invasive UC diagnosis and hold
theoretical significance and clinical value for advancing intelligent
healthcare development.