Dynamic Modality Scheduling for Multimodal Large Models via Confidence, Uncertainty, and Semantic Consistency
Journal:
arXiv
Published Date:
Jun 15, 2025
Abstract
Multimodal Large Models (MLLMs) have achieved remarkable progress in
vision-language understanding and generation tasks. However, existing MLLMs
typically rely on static modality fusion strategies, which treat all modalities
equally regardless of their instance-level reliability or semantic
contribution. This often leads to suboptimal performance, especially in
scenarios with noisy, missing, or misaligned modalities.
In this paper, we propose Dynamic Modality Scheduling (DMS), a novel
framework that adaptively adjusts the contribution of each modality at a
per-sample level. DMS evaluates each modality based on three key factors: (1)
\textit{confidence}, estimated from predictive entropy; (2)
\textit{uncertainty}, obtained via Monte Carlo dropout; and (3)
\textit{semantic consistency}, computed through inter-modal similarity. These
signals are combined through a learnable or rule-based scheduler to generate
soft modality weights used in downstream fusion.To ensure stable training, we
further introduce a \textit{Modality Weight Consistency Loss}, which
regularizes the fused representation to stay close to unimodal embeddings
proportionally to their assigned weights. Our method is model-agnostic and can
be integrated into existing MLLMs such as BLIP-2 and LLaVA. Experimental
results on VQA, image-text retrieval, and captioning tasks show that DMS
significantly improves both clean and robust performance, especially under
modality corruption or dropout conditions. This work provides a general and
effective mechanism to enable instance-aware and robustness-enhanced multimodal
modeling.