Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner
Journal:
arXiv
Published Date:
May 16, 2025
Abstract
Recent advances in vision language models (VLMs) have enabled broad progress
in the general medical field. However, pathology still remains a more
challenging subdomain, with current pathology specific VLMs exhibiting
limitations in both diagnostic accuracy and reasoning plausibility. Such
shortcomings are largely attributable to the nature of current pathology
datasets, which are primarily composed of image description pairs that lack the
depth and structured diagnostic paradigms employed by real world pathologists.
In this study, we leverage pathology textbooks and real world pathology experts
to construct high-quality, reasoning-oriented datasets. Building on this, we
introduce Patho-R1, a multimodal RL-based pathology Reasoner, trained through a
three-stage pipeline: (1) continued pretraining on 3.5 million image-text pairs
for knowledge infusion; (2) supervised fine-tuning on 500k high-quality
Chain-of-Thought samples for reasoning incentivizing; (3) reinforcement
learning using Group Relative Policy Optimization and Decoupled Clip and
Dynamic sAmpling Policy Optimization strategies for multimodal reasoning
quality refinement. To further assess the alignment quality of our dataset, we
propose PathoCLIP, trained on the same figure-caption corpus used for continued
pretraining. Comprehensive experimental results demonstrate that both PathoCLIP
and Patho-R1 achieve robust performance across a wide range of
pathology-related tasks, including zero-shot classification, cross-modal
retrieval, Visual Question Answering, and Multiple Choice Question. Our project
is available at the Patho-R1 repository:
https://github.com/Wenchuan-Zhang/Patho-R1.