MicarVLMoE: A Modern Gated Cross-Aligned Vision-Language Mixture of Experts Model for Medical Image Captioning and Report Generation
Journal:
arXiv
Published Date:
Apr 29, 2025
Abstract
Medical image reporting (MIR) aims to generate structured clinical
descriptions from radiological images. Existing methods struggle with
fine-grained feature extraction, multimodal alignment, and generalization
across diverse imaging types, often relying on vanilla transformers and
focusing primarily on chest X-rays. We propose MicarVLMoE, a vision-language
mixture-of-experts model with gated cross-aligned fusion, designed to address
these limitations. Our architecture includes: (i) a multiscale vision encoder
(MSVE) for capturing anatomical details at varying resolutions, (ii) a
multihead dual-branch latent attention (MDLA) module for vision-language
alignment through latent bottleneck representations, and (iii) a modulated
mixture-of-experts (MoE) decoder for adaptive expert specialization. We extend
MIR to CT scans, retinal imaging, MRI scans, and gross pathology images,
reporting state-of-the-art results on COVCTR, MMR, PGROSS, and ROCO datasets.
Extensive experiments and ablations confirm improved clinical accuracy,
cross-modal alignment, and model interpretability. Code is available at
https://github.com/AI-14/micar-vl-moe.