An enhanced deep learning approach for speaker diarization using TitaNet, MarbelNet and time delay network.

Journal: Scientific reports

Published Date: Jul 8, 2025

Abstract

Speaker diarization, identifying "who spoke when," plays a vital role in speech transcription, supervised fine-tuning of large language models, conversational AI, and audio content analysis by providing labeled speaker segments. Traditional speaker diarization methods, including clustering-based approaches, struggle with handling noise, overlapping speech, speaker variability, and high missed detection rates which cause performance issues of accuracy and robustness. This study presents a deep learning framework, the Neuro-TM Diarizer derived from Neural Tita-Net and Marbel-Net Diarizer for speaker diarization. It integrates noise reduction, adaptive beamforming, and neural diarization to enhance diarization performance in complex acoustic environments. The proposed multimodal framework utilizes Marble-Net for voice activity detection, and Tita-Net- for generating speaker embeddings, followed by neural diarization using time-delay neural networks for speaker identification. We evaluate the proposed approach on two standard datasets of VoxConverse and VoxCeleb, comparing clustering-based methods with the proposed Neuro-TM Diarizer using three metrics: Diarization Error Rate (DER), false alarm rate, and missed detection rate. The empirical analysis-based findings indicate that the proposed method outperforms clustering-based approaches and achieved 6.89% and 6.93% DER on VoxConverse and VoxCeleb datasets respectively. Additionally, the Neuro-TM Diarizer improved DER by 12.60% on VoxConverse and 14.01% on VoxCeleb compared to clustering-based approaches. The proposed framework contributes to real-world applications in speech transcription, speaker authentication, and audio archiving.

Authors

Muzamil Ahmed

Department of Computer Science, Namal University, Mianwali, 42210, Pakistan.
Riad Alharbey

Department of Information Systems and Technology, College of Computer Science and Engineering, University of Jeddah, Jeddah, Saudi Arabia.
Ali Daud

Faculty of Resilience, Rabdan Academy, Abu Dhabi, United Arab Emirates. alimsdb@gmail.com.
Hikmat Ullah Khan

Department of Information Technology, University of Sargodha, Sargodha, Punjab, Pakistan. dr.hikmat.niazi@gmail.com.
Javeria Nawal

Department of Computer Science, Namal University, Mianwali, 42210, Pakistan.
Ghazia Arshad

Department of Computer Science, Namal University, Mianwali, 42210, Pakistan.

Keywords

Algorithms Deep Learning Humans Neural Networks, Computer Speech

External Resources

View on PubMed Access via DOI PubMed (40628926)

An enhanced deep learning approach for speaker diarization using TitaNet, MarbelNet and time delay network.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

An enhanced deep learning approach for speaker diarization using TitaNet, MarbelNet and time delay network.

Abstract

Authors

Keywords

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals