A multi-dilated convolution network for speech emotion recognition.

Journal: Scientific reports
PMID:

Abstract

Speech emotion recognition (SER) is an important application in Affective Computing and Artificial Intelligence. Recently, there has been a significant interest in Deep Neural Networks using speech spectrograms. As the two-dimensional representation of the spectrogram includes more speech characteristics, research interest in convolution neural networks (CNNs) or advanced image recognition models is leveraged to learn deep patterns in a spectrogram to effectively perform SER. Accordingly, in this study, we propose a novel SER model based on the learning of the utterance-level spectrogram. First, we use the Spatial Pyramid Pooling (SPP) strategy to remove the size constraint associated with the CNN-based image recognition task. Then, the SPP layer is deployed to extract both the global-level prominent feature vector and multi-local-level feature vector, followed by an attention model to weigh the feature vectors. Finally, we apply the ArcFace layer, typically used for face recognition, to the SER task, thereby obtaining improved SER performance. Our model achieved an unweighted accuracy of 67.9% on IEMOCAP and 77.6% on EMODB datasets.

Authors

  • Samaneh Madanian
    Computer Science & Software Engineering, Auckland University of Technology, Auckland 1010, New Zealand.
  • Olayinka Adeleye
    Department of Data Science and Artificial Intelligence, Auckland University of Technology, Auckland, New Zealand.
  • John Michael Templeton
    University of South Florida - Department of Computer Science and Engineering, 4202 E Fowler Ave, Tampa, FL, 33620, USA. Electronic address: jtemplet@usf.edu.
  • Talen Chen
    Department of Data Science and Artificial Intelligence, Auckland University of Technology, Auckland, New Zealand.
  • Christian Poellabauer
    Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN, United States.
  • Enshi Zhang
    Florida International University - Knight Foundation School of Computing and Information Sciences, 11200 SW 8th St, Miami, FL, 33199, USA. Electronic address: ezhan004@fiu.edu.
  • Sandra L Schneider
    Department of Communication Sciences and Disorders, Saint Mary's College, Notre Dame, IN, USA.