MemoCMT: multimodal emotion recognition using cross-modal transformer-based feature fusion.

Journal: Scientific reports

PMID: 39953105

Abstract

Speech emotion recognition has seen a surge in transformer models, which excel at understanding the overall message by analyzing long-term patterns in speech. However, these models come at a computational cost. In contrast, convolutional neural networks are faster but struggle with capturing these long-range relationships. Our proposed system, MemoCMT, tackles this challenge using a novel "cross-modal transformer" (CMT). This CMT can effectively analyze local and global speech features and their corresponding text. To boost efficiency, MemoCMT leverages recent advancements in pre-trained models: HuBERT extracts meaningful features from the audio, while BERT analyzes the text. The core innovation lies in how the CMT component utilizes and integrates these audio and text features. After this integration, different fusion techniques are applied before final emotion classification. Experiments show that MemoCMT achieves impressive performance, with the CMT using min aggregation achieving the highest unweighted accuracy (UW-Acc) of 81.33% and 91.93%, and weighted accuracy (W-Acc) of 81.85% and 91.84% respectively on benchmark IEMOCAP and ESD corpora. The results of our system demonstrate the generalization capacity and robustness for real-world industrial applications. Moreover, the implementation details of MemoCMT are publicly available at https://github.com/tpnam0901/MemoCMT/ for reproducibility purposes.

Authors

Mustaqeem Khan

College of Information Technology, United Arab Emirates University (UAEU), Al Ain, Abu Dhabi, United Arab Emirates, 5551, Al Ain, UAE.
Phuong-Nam Tran

Department of Artificial Intelligence, Kyung Hee University, Yongin-si, 17104, Republic of Korea.
Nhat Truong Pham

Division of Computational Mechatronics, Institute for Computational Science, Ton Duc Thang University, Ho Chi Minh City, Vietnam.
Abdulmotaleb El Saddik

Multimedia Communications Research Laboratory, University of Ottawa, Ottawa, ON K1N6N5, Canada.
Alice Othmani

Université Paris-Est Créteil (UPEC), LISSI, Vitry sur Seine 94400, France. Electronic address: alice.othmani@u-pec.fr.

Keywords

Algorithms Emotions Humans Neural Networks, Computer Speech

External Resources

View on PubMed Access via DOI PubMed (39953105)

MemoCMT: multimodal emotion recognition using cross-modal transformer-based feature fusion.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals