TDNN achitecture with efficient channel attention and improved residual blocks for accurate speaker recognition.

Journal: Scientific reports
Published Date:

Abstract

In recent years, with the advancement of deep learning, Convolutional Neural Networks (CNNs) have been widely applied in speaker recognition, making CNN-based speaker embedding learning the predominant method for speaker verification. Time Delay Neural Networks (TDNN) have achieved notable progress in speaker embedding tasks. However, TDNN often struggles with accurately modeling multi-scale features when processing complex audio data, which can result in reduced speaker recognition accuracy. To address this issue, we propose the Efficient Parallel Channel Network - Time Delay Neural Network (EPCNet-TDNN), building upon the ECAPA-TDNN architecture. The proposed model incorporates a novel Efficient Channel and Spatial Attention Mechanism (ECAM) in the ECA_block, which replaces the original SE_block. This modification enhances the model's ability to capture key information, improving overall performance. To further reduce feature dependency and enhance multi-scale information fusion, a Parallel Residual Structure (PRS) is introduced, enabling the independent capture of multi-scale features through parallel computation instead of sequential processing. The ECA_block adopts the output structure of ECAPA-TDNN, Calling it a Tandem Structure (TS). Facilitating the integration of information from different scales and channels, resulting in more refined feature representations. After multi-scale feature extraction, the Selective State Space (SSS) module is introduced to improve the model's ability to capture temporal sequence features. Experimental results on the CN-Celeb1 dataset show that EPCNet-TDNN has a relative improvement of about 14.1% (0.025), 9.4% (0.075), and 6.6% in EER, minDCF, and ACC, respectively, compared to ECAPA-TDNN. These results demonstrate the significant improvements achieved by the proposed approach over previous methods.

Authors

  • Wenzao Li
    School of Communication Engineering, Chengdu University of Information Technology, Chengdu, 610225, Sichuan, China.
  • Sai Yao
    School of Communication Engineering, Chengdu University of Information Technology, Chengdu, 610225, Sichuan, China. yishen544@gmail.com.
  • Bing Wan
    School of Software, Chengdu Polytechnic, Chengdu, 610225, China.
  • Linsong Xiao
    School of Communication Engineering, Chengdu University of Information Technology, Chengdu, 610225, Sichuan, China.
  • Chengyu Hou
    School of Communication Engineering, Chengdu University of Information Technology, Chengdu, 610225, Sichuan, China.
  • Yanchuan Zhong
    Sichuan Provincial Climate Center, Chengdu, 610072, China.
  • Wengang Zhou
    Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, 230026, Anhui Province, China. zhwg@ustc.edu.cn.