Speech emotion recognition based on a stacked autoencoders optimized by PSO based grass fibrous root optimization.

Journal: Scientific reports
Published Date:

Abstract

Effective speech emotion recognition (SER) poses a significant challenge due to the intricate and subjective nature of human emotions. Recognizing emotional states accurately from speech signals has a broad spectrum of practical applications, such as healthcare, human-computer interaction, and social robotics. This study introduces an innovative approach that merges deep learning with metaheuristic algorithms to boost the efficiency of SER systems. Specifically, a stacked autoencoder (SAE) serves as the primary model, and its performance is fine-tuned using a nature-inspired hybrid algorithm that combines particle swarm optimization (PSO) with Grass Fibrous Root Optimization (GFRO). The proposed model adeptly extracts spectral and pitch features from speech signals, encompassing spectral crest, spectral entropy, spectral flux, and harmonic ratio, to capture emotional cues effectively. The model's performance is evaluated on a standard emotion recognition dataset, comparing with some state-of-the-art models, including Convolutional Neural Network (CNN), Support Vector Machine (SVM), Deep Learning (DL), CNN and Iterative Neighborhood Component Analysis (CNN/INCA), VGG-16 achieving high accuracy in identifying various emotional states.

Authors

  • Chi Zeng
    Xinyang Vocational and Technical College, Xinyang, 464000, Henan, China.
  • Jialing Li
    Business School, Sichuan University, Chengdu, China.
  • Abbas Habibi
    University of Tehran, Tehran, Iran. abbashabibi208@gmail.com.