PiPViT: Patch-based Visual Interpretable Prototypes for Retinal Image Analysis
Journal:
arXiv
Published Date:
Jun 12, 2025
Abstract
Background and Objective: Prototype-based methods improve interpretability by
learning fine-grained part-prototypes; however, their visualization in the
input pixel space is not always consistent with human-understandable
biomarkers. In addition, well-known prototype-based approaches typically learn
extremely granular prototypes that are less interpretable in medical imaging,
where both the presence and extent of biomarkers and lesions are critical.
Methods: To address these challenges, we propose PiPViT (Patch-based Visual
Interpretable Prototypes), an inherently interpretable prototypical model for
image recognition. Leveraging a vision transformer (ViT), PiPViT captures
long-range dependencies among patches to learn robust, human-interpretable
prototypes that approximate lesion extent only using image-level labels.
Additionally, PiPViT benefits from contrastive learning and multi-resolution
input processing, which enables effective localization of biomarkers across
scales.
Results: We evaluated PiPViT on retinal OCT image classification across four
datasets, where it achieved competitive quantitative performance compared to
state-of-the-art methods while delivering more meaningful explanations.
Moreover, quantitative evaluation on a hold-out test set confirms that the
learned prototypes are semantically and clinically relevant. We believe PiPViT
can transparently explain its decisions and assist clinicians in understanding
diagnostic outcomes. Github page: https://github.com/marziehoghbaie/PiPViT