Combining a parallel 2D CNN with a self-attention Dilated Residual Network for CTC-based discrete speech emotion recognition.

Journal: Neural networks : the official journal of the International Neural Network Society

Published Date: Mar 23, 2021

Abstract

A challenging issue in the field of the automatic recognition of emotion from speech is the efficient modelling of long temporal contexts. Moreover, when incorporating long-term temporal dependencies between features, recurrent neural network (RNN) architectures are typically employed by default. In this work, we aim to present an efficient deep neural network architecture incorporating Connectionist Temporal Classification (CTC) loss for discrete speech emotion recognition (SER). Moreover, we also demonstrate the existence of further opportunities to improve SER performance by exploiting the properties of convolutional neural networks (CNNs) when modelling contextual information. Our proposed model uses parallel convolutional layers (PCN) integrated with Squeeze-and-Excitation Network (SEnet), a system herein denoted as PCNSE, to extract relationships from 3D spectrograms across timesteps and frequencies; here, we use the log-Mel spectrogram with deltas and delta-deltas as input. In addition, a self-attention Residual Dilated Network (SADRN) with CTC is employed as a classification block for SER. To the best of the authors' knowledge, this is the first time that such a hybrid architecture has been employed for discrete SER. We further demonstrate the effectiveness of our proposed approach on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) and FAU-Aibo Emotion corpus (FAU-AEC). Our experimental results reveal that the proposed method is well-suited to the task of discrete SER, achieving a weighted accuracy (WA) of 73.1% and an unweighted accuracy (UA) of 66.3% on IEMOCAP, as well as a UA of 41.1% on the FAU-AEC dataset.

Authors

Ziping Zhao

College of Computer and Information Engineering, Tianjin Normal University, Tianjin, China.
Qifei Li

College of Computer and Information Engineering, Tianjin Normal University, Tianjin, China.
Zixing Zhang

Chair of Complex and Intelligent Systems, University of Passau, Innstr. 43, Passau 94032, Germany.
Nicholas Cummins

Department of Biostatistics & Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, United Kingdom.
Haishuai Wang
Jianhua Tao

School of Artificial Intelligence, University of Chinese Academy of Sciences, China; National Laboratory of Pattern Recognition, Chinese Academy of Sciences, China; CAS Center for Excellence in Brain Science and Intelligence Technology, China. Electronic address: jhtao@nlpr.ia.ac.cn.
Björn W Schuller

College of Computer and Information Engineering, Tianjin Normal University, Tianjin, China; GLAM - Group on Language, Audio, & Music, Imperial College London, UK; Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany.

Keywords

Child Emotions Female Humans Male Neural Networks, Computer Speech

External Resources

View on PubMed Access via DOI PubMed (33866302)

Combining a parallel 2D CNN with a self-attention Dilated Residual Network for CTC-based discrete speech emotion recognition.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals