Tri-MCA fusion: cross-modal attention and dynamic gating for multimodal sentiment analysis.
Journal:
Scientific reports
Published Date:
Jun 3, 2026
Abstract
Multimodal sentiment analysis aims to automatically infer human emotions by jointly analyzing information from multiple modalities, such as text, audio, and visual signals. Although recent deep learning approaches have significantly improved multimodal representation learning, effectively modeling interactions among heterogeneous modalities while handling modality imbalance and noisy signals remains challenging. Existing fusion strategies often rely on static or limited cross-modal interactions, which may fail to capture complex dependencies among modalities and consequently limit sentiment prediction performance. To address these challenges, this paper proposes a tri-modal cross-attention architecture with adaptive gating mechanisms for multimodal sentiment analysis. The proposed framework introduces a Tri-modal cross-attention module to better capture interactions among textual, acoustic, and visual modalities. The model is evaluated on three widely used benchmark datasets: CMU-MOSI, CMU-MOSEI, and SIMS. Experimental results demonstrate that the proposed method achieves improved performance across multiple evaluation metrics. The model obtains 85.6% Acc2 and an F1-score of 85.2 on CMU-MOSI, outperforming several existing multimodal sentiment analysis models. Similar improvements are observed on CMU-MOSEI and SIMS. Furthermore, ablation studies confirm the contribution of each component, highlighting the importance of combining cross-modal attention with adaptive gating mechanisms for robust multimodal sentiment prediction.
Authors
Keywords
No keywords available for this article.