Audio-visual source separation with localization and individual control.

Journal: PloS one

Published Date: Jan 1, 2025

Abstract

The growing reliance on video conferencing software brings significant benefits but also introduces challenges, particularly in managing audio quality. In multi-participant settings, ambient noise and interruptions can hinder speaker recognition and disrupt the flow of conversation. This work proposes an audio-visual source separation pipeline designed specifically for video conferencing and telepresence robots applications. The framework aims to isolate and enhance the speech of individual participants in noisy environments while enabling control over the volume of specific individuals captured in the video frame. The proposed pipeline comprises key components: a deep learning-based feature extractor for audio and video, an audio-guided visual attention mechanism, a module for background noise suppression and human voice separation, and Deep Multi-Resolution Network (DMRN) modules. For human voice separation, the DPRNN-TasNet, a robust deep neural network framework, is employed. Experimental results demonstrate that the methodology effectively isolates and amplifies individual participants' speech, achieving a test accuracy of 71.88 % on both the AVE and Music 21 datasets.

Authors

Mohanaprasad Kothandaraman

School of Electronics Engineering (SENSE), Vellore Institute of Technology, Chennai, India.
Balakrishnan Ramalingam

Engineering Product Development Pillar, Singapore University of Technology and Design (SUTD), Singapore 487372, Singapore.
Kai Sheng

School of Information Engineering, Zhengzhou University, Zhengzhou 450001, China.
Aman Verma

School of Electronics Engineering (SENSE), Vellore Institute of Technology, Chennai, India.
Utkarsh Dhagat

School of Electronics Engineering (SENSE), Vellore Institute of Technology, Chennai, India.
Pranav Parab

School of Electronics Engineering (SENSE), Vellore Institute of Technology, Chennai, India.
Siddhartha Mallavolu

School of Electronics Engineering (SENSE), Vellore Institute of Technology, Chennai, India.
Sankar Ganesh

School of Electronics Engineering (SENSE), Vellore Institute of Technology, Chennai, India.

Keywords

Deep Learning Humans Neural Networks, Computer Noise Speech Videoconferencing Voice

External Resources

View on PubMed Access via DOI PubMed (40408322)

Audio-visual source separation with localization and individual control.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

Audio-visual source separation with localization and individual control.

Abstract

Authors

Keywords

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals