Multi-Modal Foundation Models for Computational Pathology: A Survey
Journal:
arXiv
Published Date:
Mar 12, 2025
Abstract
Foundation models have emerged as a powerful paradigm in computational
pathology (CPath), enabling scalable and generalizable analysis of
histopathological images. While early developments centered on uni-modal models
trained solely on visual data, recent advances have highlighted the promise of
multi-modal foundation models that integrate heterogeneous data sources such as
textual reports, structured domain knowledge, and molecular profiles. In this
survey, we provide a comprehensive and up-to-date review of multi-modal
foundation models in CPath, with a particular focus on models built upon
hematoxylin and eosin (H&E) stained whole slide images (WSIs) and tile-level
representations. We categorize 32 state-of-the-art multi-modal foundation
models into three major paradigms: vision-language, vision-knowledge graph, and
vision-gene expression. We further divide vision-language models into
non-LLM-based and LLM-based approaches. Additionally, we analyze 28 available
multi-modal datasets tailored for pathology, grouped into image-text pairs,
instruction datasets, and image-other modality pairs. Our survey also presents
a taxonomy of downstream tasks, highlights training and evaluation strategies,
and identifies key challenges and future directions. We aim for this survey to
serve as a valuable resource for researchers and practitioners working at the
intersection of pathology and AI.