An enhanced deep learning approach for speaker diarization using TitaNet, MarbelNet and time delay network.
Journal:
Scientific reports
Published Date:
Jul 8, 2025
Abstract
Speaker diarization, identifying "who spoke when," plays a vital role in speech transcription, supervised fine-tuning of large language models, conversational AI, and audio content analysis by providing labeled speaker segments. Traditional speaker diarization methods, including clustering-based approaches, struggle with handling noise, overlapping speech, speaker variability, and high missed detection rates which cause performance issues of accuracy and robustness. This study presents a deep learning framework, the Neuro-TM Diarizer derived from Neural Tita-Net and Marbel-Net Diarizer for speaker diarization. It integrates noise reduction, adaptive beamforming, and neural diarization to enhance diarization performance in complex acoustic environments. The proposed multimodal framework utilizes Marble-Net for voice activity detection, and Tita-Net- for generating speaker embeddings, followed by neural diarization using time-delay neural networks for speaker identification. We evaluate the proposed approach on two standard datasets of VoxConverse and VoxCeleb, comparing clustering-based methods with the proposed Neuro-TM Diarizer using three metrics: Diarization Error Rate (DER), false alarm rate, and missed detection rate. The empirical analysis-based findings indicate that the proposed method outperforms clustering-based approaches and achieved 6.89% and 6.93% DER on VoxConverse and VoxCeleb datasets respectively. Additionally, the Neuro-TM Diarizer improved DER by 12.60% on VoxConverse and 14.01% on VoxCeleb compared to clustering-based approaches. The proposed framework contributes to real-world applications in speech transcription, speaker authentication, and audio archiving.