Multi-modal sentiment recognition with residual gating network and emotion intensity attention.
Journal:
Neural networks : the official journal of the International Neural Network Society
Published Date:
Apr 25, 2025
Abstract
Multimodal emotion recognition focuses on the prediction of emotions using text, visual and acoustic modalities, and some results have been generated in this field. Previous approaches fall short in two aspects, one is the processing of complementary information among modalities, the other is how to avoid the long-term dependency and select the most important joint modal features. In this paper, we propose a new multimodal emotion recognition framework MSRG, which consists of feature extraction (FE), emotional intensity attention (EIA), time-step level fusion (TLF), utterance level fusion (ULF), and sentiment inference module (SIM). EIA is divided into adaptive multimodal linear pooling (AMLP) and joint cross-attention fusion (JCAF), where AMLP adopts the adaptive strategy of multimodal fusion to dynamically calculate the adaptive coefficients of three modalities, then performs the pooling operation to obtain joint modal features. JCAF calculates the attention weights and attention features of each modality based on cross-correlation between individual and joint features. TLF performs feature alignment fusion at the time-step level, then uses the residual gating network (RGN) to process the time-step level fused sequences. The obtained time-step level fused features are then input into two fully connected layers and an activation layer to obtain the time-step level emotion intensity. ULF fuses the three modalities' utterance level representations by concatenating them and then inputs the obtained utterance level fused features into a fully connected layer to obtain the utterance level emotion intensity. Finally, both the time-step level emotion intensity and the utterance level emotion intensity are input into SIM to obtain the final emotion prediction results. Experiments demonstrate that MSRG achieves better prediction performance on CMU-MOSI and CMU-MOSEI datasets.