VideoPath-LLaVA: Pathology Diagnostic Reasoning Through Video Instruction Tuning

Journal: arXiv

Published Date: May 7, 2025

Abstract

We present VideoPath-LLaVA, the first large multimodal model (LMM) in computational pathology that integrates three distinct image scenarios, single patch images, automatically keyframe-extracted clips, and manually segmented video pathology images, to mimic the natural diagnostic process of pathologists. By generating detailed histological descriptions and culminating in a definitive sign-out diagnosis, VideoPath-LLaVA bridges visual narratives with diagnostic reasoning. Central to our approach is the VideoPath-Instruct dataset, comprising 4278 video and diagnosis-specific chain-of-thought instructional pairs sourced from educational histopathology videos on YouTube. Although high-quality data is critical for enhancing diagnostic reasoning, its creation is time-intensive and limited in volume. To overcome this challenge, we transfer knowledge from existing single-image instruction datasets to train on weakly annotated, keyframe-extracted clips, followed by fine-tuning on manually segmented videos. VideoPath-LLaVA establishes a new benchmark in pathology video analysis and offers a promising foundation for future AI systems that support clinical decision-making through integrated visual and diagnostic reasoning. Our code, data, and model are publicly available at https://github.com/trinhvg/VideoPath-LLaVA.

Authors

Trinh T. L. Vuong
Jin Tae Kwak

External Resources

View on arXiv arXiv (http://arxiv.org/abs/2505.04192v1)

VideoPath-LLaVA: Pathology Diagnostic Reasoning Through Video Instruction Tuning

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

VideoPath-LLaVA: Pathology Diagnostic Reasoning Through Video Instruction Tuning

Abstract

Authors

Categories

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals