MedBridge: Bridging Foundation Vision-Language Models to Medical Image Diagnosis
Journal:
arXiv
Published Date:
May 27, 2025
Abstract
Recent vision-language foundation models deliver state-of-the-art results on
natural image classification but falter on medical images due to pronounced
domain shifts. At the same time, training a medical foundation model requires
substantial resources, including extensive annotated data and high
computational capacity. To bridge this gap with minimal overhead, we introduce
MedBridge, a lightweight multimodal adaptation framework that re-purposes
pretrained VLMs for accurate medical image diagnosis. MedBridge comprises three
key components. First, a Focal Sampling module that extracts high-resolution
local regions to capture subtle pathological features and compensate for the
limited input resolution of general-purpose VLMs. Second, a Query Encoder
(QEncoder) injects a small set of learnable queries that attend to the frozen
feature maps of VLM, aligning them with medical semantics without retraining
the entire backbone. Third, a Mixture of Experts mechanism, driven by learnable
queries, harnesses the complementary strength of diverse VLMs to maximize
diagnostic performance. We evaluate MedBridge on five medical imaging
benchmarks across three key adaptation tasks, demonstrating its superior
performance in both cross-domain and in-domain adaptation settings, even under
varying levels of training data availability. Notably, MedBridge achieved over
6-15% improvement in AUC compared to state-of-the-art VLM adaptation methods in
multi-label thoracic disease diagnosis, underscoring its effectiveness in
leveraging foundation models for accurate and data-efficient medical diagnosis.
Our code is available at https://github.com/ai-med/MedBridge.