TDNN achitecture with efficient channel attention and improved residual blocks for accurate speaker recognition.

Journal: Scientific reports

Published Date: Jul 2, 2025

Abstract

In recent years, with the advancement of deep learning, Convolutional Neural Networks (CNNs) have been widely applied in speaker recognition, making CNN-based speaker embedding learning the predominant method for speaker verification. Time Delay Neural Networks (TDNN) have achieved notable progress in speaker embedding tasks. However, TDNN often struggles with accurately modeling multi-scale features when processing complex audio data, which can result in reduced speaker recognition accuracy. To address this issue, we propose the Efficient Parallel Channel Network - Time Delay Neural Network (EPCNet-TDNN), building upon the ECAPA-TDNN architecture. The proposed model incorporates a novel Efficient Channel and Spatial Attention Mechanism (ECAM) in the ECA_block, which replaces the original SE_block. This modification enhances the model's ability to capture key information, improving overall performance. To further reduce feature dependency and enhance multi-scale information fusion, a Parallel Residual Structure (PRS) is introduced, enabling the independent capture of multi-scale features through parallel computation instead of sequential processing. The ECA_block adopts the output structure of ECAPA-TDNN, Calling it a Tandem Structure (TS). Facilitating the integration of information from different scales and channels, resulting in more refined feature representations. After multi-scale feature extraction, the Selective State Space (SSS) module is introduced to improve the model's ability to capture temporal sequence features. Experimental results on the CN-Celeb1 dataset show that EPCNet-TDNN has a relative improvement of about 14.1% (0.025), 9.4% (0.075), and 6.6% in EER, minDCF, and ACC, respectively, compared to ECAPA-TDNN. These results demonstrate the significant improvements achieved by the proposed approach over previous methods.

Authors

Wenzao Li

School of Communication Engineering, Chengdu University of Information Technology, Chengdu, 610225, Sichuan, China.
Sai Yao

School of Communication Engineering, Chengdu University of Information Technology, Chengdu, 610225, Sichuan, China. yishen544@gmail.com.
Bing Wan

School of Software, Chengdu Polytechnic, Chengdu, 610225, China.
Linsong Xiao

School of Communication Engineering, Chengdu University of Information Technology, Chengdu, 610225, Sichuan, China.
Chengyu Hou

School of Communication Engineering, Chengdu University of Information Technology, Chengdu, 610225, Sichuan, China.
Yanchuan Zhong

Sichuan Provincial Climate Center, Chengdu, 610072, China.
Wengang Zhou

Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, 230026, Anhui Province, China. zhwg@ustc.edu.cn.

Keywords

Algorithms Deep Learning Humans Neural Networks, Computer

External Resources

View on PubMed Access via DOI PubMed (40603701)

TDNN achitecture with efficient channel attention and improved residual blocks for accurate speaker recognition.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

TDNN achitecture with efficient channel attention and improved residual blocks for accurate speaker recognition.

Abstract

Authors

Keywords

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals