Endpoint-aware audio-visual speech enhancement utilizing dynamic weight modulation based on SNR estimation.

Journal: Neural networks : the official journal of the International Neural Network Society

PMID: 39874821

Abstract

Integrating visual features has been proven effective for deep learning-based speech quality enhancement, particularly in highly noisy environments. However, these models may suffer from redundant information, resulting in performance deterioration when the signal-to-noise ratio (SNR) is relatively high. Real-world noisy scenarios typically exhibit widely varying noise levels. To address the above issues, this study proposes a novel Audio-Visual Speech Enhancement (AVSE) system incorporating audio and visual voice activity information, utilizing attention techniques based on an SNR estimation module, dynamically adjusting the audio and visual endpoint information weights during evaluation based on the environmental noise level. The dynamic modulation makes the model an Endpoint-Aware Network (EANet). The model prioritizes the desired voice period, thereby enhancing speech intelligibility by jointly leveraging noisy acoustic cues and noise-robust visual cues. Experiments are conducted using benchmark datasets. The results indicate that EANet effectively integrates audio and visual information, demonstrating improved performance compared to the audio-only model, especially in scenarios with wide SNR ranges. Therefore, this work shows its efficacy in improving the fusion effectiveness of multimodal information for AVSE, enhancing the quality and intelligibility of the speech.

Authors

Zhehui Zhu

School of automotive studies, Tongji University, Shanghai 201804, China. Electronic address: 2131577@tongji.edu.cn.
Lijun Zhang

Department of Paediatric Orthopaedics, Shengjing Hospital of China Medical University, Shenyang, Liaoning Province, China.
Kaikun Pei

School of automotive studies, Tongji University, Shanghai 201804, China.
Siqi Chen

College of Animal Science and Technology, Jilin Agricultural University, Changchun, China.

Keywords

Deep Learning Humans Neural Networks, Computer Noise Signal-To-Noise Ratio Speech Speech Intelligibility Speech Perception

External Resources

View on PubMed Access via DOI PubMed (39874821)

Endpoint-aware audio-visual speech enhancement utilizing dynamic weight modulation based on SNR estimation.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals