Enhancing target speaker extraction with Hierarchical Speaker Representation Learning.
Journal:
Neural networks : the official journal of the International Neural Network Society
Published Date:
Mar 27, 2025
Abstract
Target speaker extraction aims to obtain the speech of the specific speaker from a mixture of multiple voices. The conventional approach exploits the target speaker embeddings from a pre-recorded speech segment as auxiliary information, providing prior for extraction. However, the naive single-vector embedding may lack attention to the subtle acoustic features such as pitch and harmonic distribution in the auxiliary speech, leading to an unsatisfying performance. Furthermore, traditional speaker embeddings are trained by speaker verification system and do not leverage the semantics of the auxiliary speech which may facilitate the extraction. To address these challenges, we propose a simple yet effective Hierarchical Speaker Representation Learning (HSRL). The proposed method comprises three modules: a Local Speaker Feature Extractor (LSFE), a Global Speaker Feature Extractor (GSFE), and a Hierarchical Cascading Input Strategy (HCIS). Specifically, the LSFE utilizes the fine-grained acoustic information in the anchor speech. In GSFE, we utilize ECAPA-TDNN to obtain the speaker embeddings of the target speaker, enhancing extraction performance with this global speaker information. In additional, a novel HCIS is proposed to integrate the output of the LSFE module to the input of the GSFE, which enables the global speaker features to focus on the semantic content of the pre-recorded speech. Experimental results on the Libri-2talker dataset demonstrate that our HSRL has achieved significant performance improvements and established new optimal benchmarks.