End-to-end feature fusion for jointly optimized speech enhancement and automatic speech recognition.

Journal: Scientific reports
Published Date:

Abstract

Speech enhancement (SE) and automatic speech recognition (ASR) in real-time processing involve improving the quality and intelligibility of speech signals on the fly, ensuring accurate transcription as the speech unfolds. SE eliminates unwanted background noise from target speech in environments with high background noise levels, which is crucial in real-time ASR. This study first proposes a speech enhancement network based on an attentional-codec model. Its primary objective is to suppress noise in the target speech with minimal distortion. However, excessive noise suppression in the enhanced speech can potentially diminish the effectiveness of downstream ASR systems by excluding crucial latent information. While joint SE and ASR techniques have shown promise for achieving robust end-to-end ASR, they traditionally rely on using the enhanced features as inputs to the ASR systems. To address this limitation, our study uses a dynamic fusion approach. This approach integrates both the enhanced features and the raw noisy features, aiming to eliminate noise signals from the enhanced target speech while simultaneously learning fine details from the noisy signals. This fusion approach seeks to mitigate speech distortions, enhancing the overall performance of the ASR system. The proposed model consists of an attentional codec equipped with a causal attention mechanism for SE, a GRU-based fusion network, and an ASR system. The SE network uses a modified Gated Recurrent Unit (GRU), where the traditional hyperbolic tangent (tanh) is replaced by an attention-based rectified linear unit (AReLU). The SE experiments consistently obtain better speech quality, intelligibility, and noise suppression in matched and unmatched conditions than the baselines. With the LibriSpeech database, the proposed SE obtains better STOI (19.81%) and PESQ (28.97%) in matched conditions and unmatched conditions (STOI: 17.27% and PESQ: 27.51%). The joint training framework for robust end-to-end ASR evaluates the character error rate (CER). The ASR results find that the joint training framework reduces the error rate from 32.99% (average noisy signals) to 13.52% (with the proposed SE and joint training for ASR).

Authors

  • Mohamed Medani
    Department of Information Systems, Applied College at Mahayil, King Khalid University, Abha, Kingdom of Saudi Arabia.
  • Nasir Saleem
    Department of Electrical Engineering, FET, Gomal University, D.I. Khan, 29050, Pakistan.
  • Fethi Fkih
    Department of Computer Science, College of Computer, Qassim University, Buraydah, Saudi Arabia. Electronic address: f.fki@qu.edu.sa.
  • Manal Abdullah Alohali
    Department of Information Systems, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia.
  • Hela Elmannai
    Department of Information Technology, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O.Box 84428, Riyadh 11671, Saudi Arabia.
  • Sami Bourouis
    Department of Information Technology, College of Computers and Information Technology, Taif University, P.O. Box 11099, Taif 21944, Saudi Arabia.