Fine-tuning Vision Language Models with Graph-based Knowledge for Explainable Medical Image Analysis
Journal:
arXiv
Published Date:
Mar 12, 2025
Abstract
Accurate staging of Diabetic Retinopathy (DR) is essential for guiding timely
interventions and preventing vision loss. However, current staging models are
hardly interpretable, and most public datasets contain no clinical reasoning or
interpretation beyond image-level labels. In this paper, we present a novel
method that integrates graph representation learning with vision-language
models (VLMs) to deliver explainable DR diagnosis. Our approach leverages
optical coherence tomography angiography (OCTA) images by constructing
biologically informed graphs that encode key retinal vascular features such as
vessel morphology and spatial connectivity. A graph neural network (GNN) then
performs DR staging while integrated gradients highlight critical nodes and
edges and their individual features that drive the classification decisions. We
collect this graph-based knowledge which attributes the model's prediction to
physiological structures and their characteristics. We then transform it into
textual descriptions for VLMs. We perform instruction-tuning with these textual
descriptions and the corresponding image to train a student VLM. This final
agent can classify the disease and explain its decision in a human
interpretable way solely based on a single image input. Experimental
evaluations on both proprietary and public datasets demonstrate that our method
not only improves classification accuracy but also offers more clinically
interpretable results. An expert study further demonstrates that our method
provides more accurate diagnostic explanations and paves the way for precise
localization of pathologies in OCTA images.