CPath-Omni: A Unified Multimodal Foundation Model for Patch and Whole Slide Image Analysis in Computational Pathology
Journal:
arXiv
Published Date:
Dec 16, 2024
Abstract
The emergence of large multimodal models (LMMs) has brought significant
advancements to pathology. Previous research has primarily focused on
separately training patch-level and whole-slide image (WSI)-level models,
limiting the integration of learned knowledge across patches and WSIs, and
resulting in redundant models. In this work, we introduce CPath-Omni, the first
15-billion-parameter LMM designed to unify both patch and WSI level image
analysis, consolidating a variety of tasks at both levels, including
classification, visual question answering, captioning, and visual referring
prompting. Extensive experiments demonstrate that CPath-Omni achieves
state-of-the-art (SOTA) performance across seven diverse tasks on 39 out of 42
datasets, outperforming or matching task-specific models trained for individual
tasks. Additionally, we develop a specialized pathology CLIP-based visual
processor for CPath-Omni, CPath-CLIP, which, for the first time, integrates
different vision models and incorporates a large language model as a text
encoder to build a more powerful CLIP model, which achieves SOTA performance on
nine zero-shot and four few-shot datasets. Our findings highlight CPath-Omni's
ability to unify diverse pathology tasks, demonstrating its potential to
streamline and advance the field of foundation model in pathology.