Adapting Lightweight Vision Language Models for Radiological Visual Question Answering
Journal:
arXiv
Published Date:
Jun 17, 2025
Abstract
Recent advancements in vision-language systems have improved the accuracy of
Radiological Visual Question Answering (VQA) Models. However, some challenges
remain across each stage of model development: limited expert-labeled images
hinders data procurement at scale; the intricate and nuanced patterns of
radiological images make modeling inherently difficult; and the lack of
evaluation evaluation efforts makes it difficult to identify cases where the
model might be ill-conditioned. In this study, we fine-tune a lightweight 3B
parameter vision-language model for Radiological VQA, demonstrating that small
models, when appropriately tuned with curated data, can achieve robust
performance across both open- and closed-ended questions. We propose a
cost-effective training pipeline from synthetic question-answer pair generation
to multi-stage fine-tuning on specialised radiological domain-targeted datasets
(e.g., ROCO v2.0, MedPix v2.0). Our results show that despite operating at a
fraction of the scale of state-of-the-art models such as LLaVA-Med, our model
achieves promising performance given its small parameter size and the limited
scale of training data. We introduce a lightweight saliency-based diagnostic
tool that enables domain experts to inspect VQA model performance and identify
ill-conditioned failure modes through saliency analysis.