Vision-referential speech enhancement of an audio signal using mask information captured as visual data.
Journal:
The Journal of the Acoustical Society of America
Published Date:
Jan 1, 2019
Abstract
This paper describes a vision-referential speech enhancement of an audio signal using mask information captured as visual data. Smartphones and tablet devices have become popular in recent years. Most of them not only have a microphone but also a camera. Although the frame rate of the camera in such devices is very low compared to the audio signal from the microphone, it will be useful to enhance the speech signal if both signals are used adequately. In the proposed method, the speaker broadcasts not only his/her speech signal through a loudspeaker but also its mask information through a display. The receiver can enhance the speech combining the speech signal captured by the microphone and the reference signal captured by the camera. Some experiments were conducted to evaluate the effectiveness of the proposed method compared to a typical sparse approach. It was confirmed that the speech could be enhanced even when there were different kinds of noise and a high level of real noise in the environments. Experiments were also conducted to check the sound quality of the proposed method. They were compared to clear audio data compressed with various bps mp3 format. The sound quality was sufficient for practical application.