HiLa: Hierarchical Vision-Language Collaboration for Cancer Survival Prediction
Journal:
arXiv
Published Date:
Jul 7, 2025
Abstract
Survival prediction using whole-slide images (WSIs) is crucial in cancer
re-search. Despite notable success, existing approaches are limited by their
reliance on sparse slide-level labels, which hinders the learning of
discriminative repre-sentations from gigapixel WSIs. Recently, vision language
(VL) models, which incorporate additional language supervision, have emerged as
a promising solu-tion. However, VL-based survival prediction remains largely
unexplored due to two key challenges. First, current methods often rely on only
one simple lan-guage prompt and basic cosine similarity, which fails to learn
fine-grained associ-ations between multi-faceted linguistic information and
visual features within WSI, resulting in inadequate vision-language alignment.
Second, these methods primarily exploit patch-level information, overlooking
the intrinsic hierarchy of WSIs and their interactions, causing ineffective
modeling of hierarchical interac-tions. To tackle these problems, we propose a
novel Hierarchical vision-Language collaboration (HiLa) framework for improved
survival prediction. Specifically, HiLa employs pretrained feature extractors
to generate hierarchical visual features from WSIs at both patch and region
levels. At each level, a series of language prompts describing various
survival-related attributes are constructed and aligned with visual features
via Optimal Prompt Learning (OPL). This ap-proach enables the comprehensive
learning of discriminative visual features cor-responding to different
survival-related attributes from prompts, thereby improv-ing vision-language
alignment. Furthermore, we introduce two modules, i.e., Cross-Level Propagation
(CLP) and Mutual Contrastive Learning (MCL) to maximize hierarchical
cooperation by promoting interactions and consistency be-tween patch and region
levels. Experiments on three TCGA datasets demonstrate our SOTA performance.