MS-EmoBoost: a novel strategy for enhancing self-supervised speech emotion representations.

Journal: Scientific reports
Published Date:

Abstract

Extracting richer emotional representations from raw speech is one of the key approaches to improving the accuracy of Speech Emotion Recognition (SER). In recent years, there has been a trend in utilizing self-supervised learning (SSL) for extracting SER features, due to the exceptional performance of SSL in Automatic Speech Recognition (ASR). However, existing SSL methods are not sufficiently sensitive in capturing emotional information, making them less effective for SER tasks. To overcome this issue, this study proposes MS-EmoBoost, a novel strategy for enhancing self-supervised speech emotion representations. Specifically, MS-EmoBoost uses the deep emotional information from Melfrequency cepstral coefficient (MFCC) and spectrogram as guidance to enhance the emotional representation capabilities of self-supervised features. To determine the effectiveness of our proposed approach, we conduct a comprehensive experiment on three benchmark speech emotion datasets: IEMOCAP, EMODB, and EMOVO. The SER performance is measured by weighted accuracy (WA) and unweighted accuracy (UA). The experimental results show that our method successfully enhances the emotional representation capability of wav2vec 2.0 Base features, achieving competitive performance in SER tasks (IEMOCAP:WA,72.10%; UA,72.91%; EMODB:WA,92.45%; UA,92.62%; EMOVO:WA,86.88%; UA,87.51%), and proves effective for other self-supervised features.

Authors

  • Hongchen Song
    College of Computer and Information Engineering, Tianjin Normal University, Tianjin, 300387, China.
  • Long Zhang
    Hefei Institute of Physical Science, Chinese Academy of Sciences Hefei 230036 PR China liuyong@aiofm.ac.cn zhanglong@aiofm.ac.cn wangchongwen1987@126.com.
  • Meixian Gao
    College of Computer and Information Engineering, Tianjin Normal University, Tianjin, 300387, China.
  • Hengyuan Zhang
    College of Engineering, South China Agricultural University, Guangzhou, China; State Key Laboratory of Livestock and Poultry Breeding, Guangzhou, China.
  • Thomas Hain
    School of Computer Science, The University of Sheffield, Sheffield, UK.
  • Linlin Shan
    College of Fine Arts and Design, Tianjin Normal University, Tianjin, 300387, China. shanlinlin@tjnu.edu.cn.