VideoPath-LLaVA: Pathology Diagnostic Reasoning Through Video Instruction Tuning
Journal:
arXiv
Published Date:
May 7, 2025
Abstract
We present VideoPath-LLaVA, the first large multimodal model (LMM) in
computational pathology that integrates three distinct image scenarios, single
patch images, automatically keyframe-extracted clips, and manually segmented
video pathology images, to mimic the natural diagnostic process of
pathologists. By generating detailed histological descriptions and culminating
in a definitive sign-out diagnosis, VideoPath-LLaVA bridges visual narratives
with diagnostic reasoning.
Central to our approach is the VideoPath-Instruct dataset, comprising 4278
video and diagnosis-specific chain-of-thought instructional pairs sourced from
educational histopathology videos on YouTube. Although high-quality data is
critical for enhancing diagnostic reasoning, its creation is time-intensive and
limited in volume. To overcome this challenge, we transfer knowledge from
existing single-image instruction datasets to train on weakly annotated,
keyframe-extracted clips, followed by fine-tuning on manually segmented videos.
VideoPath-LLaVA establishes a new benchmark in pathology video analysis and
offers a promising foundation for future AI systems that support clinical
decision-making through integrated visual and diagnostic reasoning. Our code,
data, and model are publicly available at
https://github.com/trinhvg/VideoPath-LLaVA.