MS-EmoBoost: a novel strategy for enhancing self-supervised speech emotion representations.

Journal: Scientific reports

Published Date: Jul 1, 2025

Abstract

Extracting richer emotional representations from raw speech is one of the key approaches to improving the accuracy of Speech Emotion Recognition (SER). In recent years, there has been a trend in utilizing self-supervised learning (SSL) for extracting SER features, due to the exceptional performance of SSL in Automatic Speech Recognition (ASR). However, existing SSL methods are not sufficiently sensitive in capturing emotional information, making them less effective for SER tasks. To overcome this issue, this study proposes MS-EmoBoost, a novel strategy for enhancing self-supervised speech emotion representations. Specifically, MS-EmoBoost uses the deep emotional information from Melfrequency cepstral coefficient (MFCC) and spectrogram as guidance to enhance the emotional representation capabilities of self-supervised features. To determine the effectiveness of our proposed approach, we conduct a comprehensive experiment on three benchmark speech emotion datasets: IEMOCAP, EMODB, and EMOVO. The SER performance is measured by weighted accuracy (WA) and unweighted accuracy (UA). The experimental results show that our method successfully enhances the emotional representation capability of wav2vec 2.0 Base features, achieving competitive performance in SER tasks (IEMOCAP:WA,72.10%; UA,72.91%; EMODB:WA,92.45%; UA,92.62%; EMOVO:WA,86.88%; UA,87.51%), and proves effective for other self-supervised features.

Authors

Hongchen Song

College of Computer and Information Engineering, Tianjin Normal University, Tianjin, 300387, China.
Long Zhang

Hefei Institute of Physical Science, Chinese Academy of Sciences Hefei 230036 PR China liuyong@aiofm.ac.cn zhanglong@aiofm.ac.cn wangchongwen1987@126.com.
Meixian Gao

College of Computer and Information Engineering, Tianjin Normal University, Tianjin, 300387, China.
Hengyuan Zhang

College of Engineering, South China Agricultural University, Guangzhou, China; State Key Laboratory of Livestock and Poultry Breeding, Guangzhou, China.
Thomas Hain

School of Computer Science, The University of Sheffield, Sheffield, UK.
Linlin Shan

College of Fine Arts and Design, Tianjin Normal University, Tianjin, 300387, China. shanlinlin@tjnu.edu.cn.

Keywords

Algorithms Emotions Humans Speech Speech Recognition Software Supervised Machine Learning

External Resources

View on PubMed Access via DOI PubMed (40595398)

MS-EmoBoost: a novel strategy for enhancing self-supervised speech emotion representations.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

MS-EmoBoost: a novel strategy for enhancing self-supervised speech emotion representations.

Abstract

Authors

Keywords

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals