An enhanced deep learning approach for speaker diarization using TitaNet, MarbelNet and time delay network.

Journal: Scientific reports
Published Date:

Abstract

Speaker diarization, identifying "who spoke when," plays a vital role in speech transcription, supervised fine-tuning of large language models, conversational AI, and audio content analysis by providing labeled speaker segments. Traditional speaker diarization methods, including clustering-based approaches, struggle with handling noise, overlapping speech, speaker variability, and high missed detection rates which cause performance issues of accuracy and robustness. This study presents a deep learning framework, the Neuro-TM Diarizer derived from Neural Tita-Net and Marbel-Net Diarizer for speaker diarization. It integrates noise reduction, adaptive beamforming, and neural diarization to enhance diarization performance in complex acoustic environments. The proposed multimodal framework utilizes Marble-Net for voice activity detection, and Tita-Net- for generating speaker embeddings, followed by neural diarization using time-delay neural networks for speaker identification. We evaluate the proposed approach on two standard datasets of VoxConverse and VoxCeleb, comparing clustering-based methods with the proposed Neuro-TM Diarizer using three metrics: Diarization Error Rate (DER), false alarm rate, and missed detection rate. The empirical analysis-based findings indicate that the proposed method outperforms clustering-based approaches and achieved 6.89% and 6.93% DER on VoxConverse and VoxCeleb datasets respectively. Additionally, the Neuro-TM Diarizer improved DER by 12.60% on VoxConverse and 14.01% on VoxCeleb compared to clustering-based approaches. The proposed framework contributes to real-world applications in speech transcription, speaker authentication, and audio archiving.

Authors

  • Muzamil Ahmed
    Department of Computer Science, Namal University, Mianwali, 42210, Pakistan.
  • Riad Alharbey
    Department of Information Systems and Technology, College of Computer Science and Engineering, University of Jeddah, Jeddah, Saudi Arabia.
  • Ali Daud
    Faculty of Resilience, Rabdan Academy, Abu Dhabi, United Arab Emirates. alimsdb@gmail.com.
  • Hikmat Ullah Khan
    Department of Information Technology, University of Sargodha, Sargodha, Punjab, Pakistan. dr.hikmat.niazi@gmail.com.
  • Javeria Nawal
    Department of Computer Science, Namal University, Mianwali, 42210, Pakistan.
  • Ghazia Arshad
    Department of Computer Science, Namal University, Mianwali, 42210, Pakistan.