Vision-referential speech enhancement of an audio signal using mask information captured as visual data.

Journal: The Journal of the Acoustical Society of America

Published Date: Jan 1, 2019

Abstract

This paper describes a vision-referential speech enhancement of an audio signal using mask information captured as visual data. Smartphones and tablet devices have become popular in recent years. Most of them not only have a microphone but also a camera. Although the frame rate of the camera in such devices is very low compared to the audio signal from the microphone, it will be useful to enhance the speech signal if both signals are used adequately. In the proposed method, the speaker broadcasts not only his/her speech signal through a loudspeaker but also its mask information through a display. The receiver can enhance the speech combining the speech signal captured by the microphone and the reference signal captured by the camera. Some experiments were conducted to evaluate the effectiveness of the proposed method compared to a typical sparse approach. It was confirmed that the speech could be enhanced even when there were different kinds of noise and a high level of real noise in the environments. Experiments were also conducted to check the sound quality of the proposed method. They were compared to clear audio data compressed with various bps mp3 format. The sound quality was sufficient for practical application.

Authors

Mitsuharu Matsumoto

Department of Informatics, University of Electro-Communications, 1-5-1, Chofugaoka, Chofu-shi, Tokyo, 182-8585, Japan.

Keywords

Adult Female Humans Image Processing, Computer-Assisted Male Natural Language Processing Signal-To-Noise Ratio Speech Perception Speech Recognition Software

External Resources

View on PubMed Access via DOI PubMed (30710939)

Vision-referential speech enhancement of an audio signal using mask information captured as visual data.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals