Guiding Medical Vision-Language Models with Explicit Visual Prompts: Framework Design and Comprehensive Exploration of Prompt Variations
Journal:
arXiv
Published Date:
Jan 4, 2025
Abstract
While mainstream vision-language models (VLMs) have advanced rapidly in
understanding image level information, they still lack the ability to focus on
specific areas designated by humans. Rather, they typically rely on large
volumes of high-quality image-text paired data to learn and generate posterior
attention maps. To address this critical issue, we propose leveraging visual
prompts:simple visual markers in various forms to guide and enhance the
formation of region-specific attention. Thus, we introduce MedVP, a pioneering
framework that integrates medical entity extraction, visual prompt generation,
and dataset adaptation for visual prompt guided fine-tuning. We successfully
outperform recent state-of-the-art large models across multiple medical VQA
datasets. Extensive experiments and Human evaluation are conducted to analyze
the impact of different visual prompt forms and how they contribute to
performance improvement. The results demonstrate both the effectiveness and
clinical significance of our approach.