MOTOR: Multimodal Optimal Transport via Grounded Retrieval in Medical Visual Question Answering
Journal:
arXiv
Published Date:
Jun 28, 2025
Abstract
Medical visual question answering (MedVQA) plays a vital role in clinical
decision-making by providing contextually rich answers to image-based queries.
Although vision-language models (VLMs) are widely used for this task, they
often generate factually incorrect answers. Retrieval-augmented generation
addresses this challenge by providing information from external sources, but
risks retrieving irrelevant context, which can degrade the reasoning
capabilities of VLMs. Re-ranking retrievals, as introduced in existing
approaches, enhances retrieval relevance by focusing on query-text alignment.
However, these approaches neglect the visual or multimodal context, which is
particularly crucial for medical diagnosis. We propose MOTOR, a novel
multimodal retrieval and re-ranking approach that leverages grounded captions
and optimal transport. It captures the underlying relationships between the
query and the retrieved context based on textual and visual information.
Consequently, our approach identifies more clinically relevant contexts to
augment the VLM input. Empirical analysis and human expert evaluation
demonstrate that MOTOR achieves higher accuracy on MedVQA datasets,
outperforming state-of-the-art methods by an average of 6.45%. Code is
available at https://github.com/BioMedIA-MBZUAI/MOTOR.