REVIEW ARTICLE: A Performance Benchmarking Review of Transformers for Speaker-Independent Speech Emotion Recognition.

Journal: International journal of neural systems
Published Date:

Abstract

Speech Emotion Recognition (SER) is becoming a key element of speech-based human-computer interfaces, endowing them with some form of empathy towards the emotional status of the human. Transformers have become a central Deep Learning (DL) architecture in natural language processing and signal processing, recently including audio signals for Automatic Speech Recognition (ASR) and SER. A central question addressed in this paper is the achievement of speaker-independent SER systems, i.e. systems that perform independently of a specific training set, enabling their deployment in real-world situations by overcoming the typical limitations of laboratory environments. This paper presents a comprehensive performance evaluation review of transformer architectures that have been proposed to deal with the SER task, carrying out an independent validation at different levels over the most relevant publicly available datasets for validation of SER models. The comprehensive experimental design implemented in this paper provides an accurate picture of the performance achieved by current state-of-the-art transformer models in speaker-independent SER. We have found that most experimental instances reach accuracies below 40% when a model is trained on a dataset and tested on a different one. A speaker-independent evaluation combining up to five datasets and testing on a different one achieves up to 58.85% accuracy. In conclusion, the SER results improved with the aggregation of datasets, indicating that model generalization can be enhanced by extracting data from diverse datasets.

Authors

  • Francisco Portal
    Department of Artificial Intelligence, Universidad Politécnica de Madrid, Madrid, Spain.
  • Javier De Lope
    Department of Artificial Intelligence, Universidad Politécnica de Madrid (UPM), Madrid, Spain.
  • Manuel Graña
    Computational Intelligence Group, Faculty of Informatics, Basque Country University (UPV/EHU), Paseo Manuel de Lardizabal 1, 20018 San Sebastian, Spain; Department of Computer Science and Artificial Intelligence, Faculty of Informatics, Basque Country University (UPV/EHU), Paseo Manuel de Lardizabal 1, 20018 San Sebastian, Spain; ENGINE Centre, Wrocław University of Technology, Wybrzeże Wyspiańskiego 27, 50-370 Wrocław, Poland.

Keywords

No keywords available for this article.