Learning to Fuse: Modality-Aware Adaptive Scheduling for Robust Multimodal Foundation Models
Journal:
arXiv
Published Date:
Jun 15, 2025
Abstract
Multimodal foundation models have achieved impressive progress across a wide
range of vision-language tasks. However, existing approaches often adopt fixed
or task-specific fusion strategies, neglecting the intrinsic variability of
modality reliability and sample complexity. In this paper, we propose
Modality-Aware Adaptive Fusion Scheduling (MA-AFS), a general framework that
learns to dynamically modulate the contribution of each modality on a
per-instance basis. MA-AFS introduces a lightweight neural scheduler that
predicts modality fusion weights by integrating visual and textual entropy
signals along with cross-modal agreement cues. This enables the model to
adaptively emphasize more reliable modalities, especially under noisy, missing,
or misaligned inputs. We formulate the fusion process as a differentiable
scheduling mechanism, analyze its theoretical consistency and regularization
effect, and demonstrate that it improves robustness without increasing model
capacity significantly. Extensive experiments on image-text retrieval,
captioning, and visual question answering show that MA-AFS achieves consistent
performance gains over strong baselines such as CLIP, ALBEF, and BLIP.
Moreover, MA-AFS exhibits improved robustness under modality corruption and
enhanced generalization under domain shifts. Our work highlights the importance
of adaptive fusion and opens a promising direction toward reliable and
uncertainty-aware multimodal learning.