Cross-Modal Prototype Allocation: Unsupervised Slide Representation Learning via Patch-Text Contrast in Computational Pathology
Journal:
arXiv
Published Date:
Mar 26, 2025
Abstract
With the rapid advancement of pathology foundation models (FMs), the
representation learning of whole slide images (WSIs) attracts increasing
attention. Existing studies develop high-quality patch feature extractors and
employ carefully designed aggregation schemes to derive slide-level
representations. However, mainstream weakly supervised slide representation
learning methods, primarily based on multiple instance learning (MIL), are
tailored to specific downstream tasks, which limits their generalizability. To
address this issue, some studies explore unsupervised slide representation
learning. However, these approaches focus solely on the visual modality of
patches, neglecting the rich semantic information embedded in textual data. In
this work, we propose ProAlign, a cross-modal unsupervised slide representation
learning framework. Specifically, we leverage a large language model (LLM) to
generate descriptive text for the prototype types present in a WSI, introducing
patch-text contrast to construct initial prototype embeddings. Furthermore, we
propose a parameter-free attention aggregation strategy that utilizes the
similarity between patches and these prototypes to form unsupervised slide
embeddings applicable to a wide range of downstream tasks. Extensive
experiments on four public datasets show that ProAlign outperforms existing
unsupervised frameworks and achieves performance comparable to some weakly
supervised models.