Enhancing target speaker extraction with Hierarchical Speaker Representation Learning.

Journal: Neural networks : the official journal of the International Neural Network Society

Published Date: Mar 27, 2025

Abstract

Target speaker extraction aims to obtain the speech of the specific speaker from a mixture of multiple voices. The conventional approach exploits the target speaker embeddings from a pre-recorded speech segment as auxiliary information, providing prior for extraction. However, the naive single-vector embedding may lack attention to the subtle acoustic features such as pitch and harmonic distribution in the auxiliary speech, leading to an unsatisfying performance. Furthermore, traditional speaker embeddings are trained by speaker verification system and do not leverage the semantics of the auxiliary speech which may facilitate the extraction. To address these challenges, we propose a simple yet effective Hierarchical Speaker Representation Learning (HSRL). The proposed method comprises three modules: a Local Speaker Feature Extractor (LSFE), a Global Speaker Feature Extractor (GSFE), and a Hierarchical Cascading Input Strategy (HCIS). Specifically, the LSFE utilizes the fine-grained acoustic information in the anchor speech. In GSFE, we utilize ECAPA-TDNN to obtain the speaker embeddings of the target speaker, enhancing extraction performance with this global speaker information. In additional, a novel HCIS is proposed to integrate the output of the LSFE module to the input of the GSFE, which enables the global speaker features to focus on the semantic content of the pre-recorded speech. Experimental results on the Libri-2talker dataset demonstrate that our HSRL has achieved significant performance improvements and established new optimal benchmarks.

Authors

Shulin He

College of Computer Science, Inner Mongolia University, Hohhot, China. Electronic address: heshulin@mail.imu.edu.cn.
Wei Xue

School of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095, Jiangsu, China.
Yang Yang

Department of Gastrointestinal Surgery, The Third Hospital of Hebei Medical University, Shijiazhuang, China.
Huaiwen Zhang

Department of Radiotherapy, Jiangxi Cancer Hospital, The Second Affiliated Hospital of Nanchang Medical College, Jiangxi Clinical Research Center for Cancer, Nanchang, China.
Jiahao Pan

Division of Emerging Interdisciplinary Areas, Hong Kong University of Science and Technology, Hong Kong Special Administrative Region of China. Electronic address: jiahaopan@ust.hk.
Xueliang Zhang

College of Medical Engineering and Technology, Xinjiang Medical University, Urumqi, 830054, People's Republic of China. shuxue2456@126.com.

Keywords

Humans Learning Machine Learning Neural Networks, Computer Speech

External Resources

View on PubMed Access via DOI PubMed (40215664)

Enhancing target speaker extraction with Hierarchical Speaker Representation Learning.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

Enhancing target speaker extraction with Hierarchical Speaker Representation Learning.

Abstract

Authors

Keywords

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals