Convolutional fusion network for monaural speech enhancement.

Journal: Neural networks : the official journal of the International Neural Network Society
Published Date:

Abstract

Convolutional neural network (CNN) based methods, such as the convolutional encoder-decoder network, offer state-of-the-art results in monaural speech enhancement. In the conventional encoder-decoder network, large kernel size is often used to enhance the model capacity, which, however, results in low parameter efficiency. This could be addressed by using group convolution, as in AlexNet, where group convolutions are performed in parallel in each layer, before their outputs are concatenated. However, with the simple concatenation, the inter-channel dependency information may be lost. To address this, the Shuffle network re-arranges the outputs of each group before concatenating them, by taking part of the whole input sequence as the input to each group of convolution. In this work, we propose a new convolutional fusion network (CFN) for monaural speech enhancement by improving model performance, inter-channel dependency, information reuse and parameter efficiency. First, a new group convolutional fusion unit (GCFU) consisting of the standard and depth-wise separable CNN is used to reconstruct the signal. Second, the whole input sequence (full information) is fed simultaneously to two convolution networks in parallel, and their outputs are re-arranged (shuffled) and then concatenated, in order to exploit the inter-channel dependency within the network. Third, the intra skip connection mechanism is used to connect different layers inside the encoder as well as decoder to further improve the model performance. Extensive experiments are performed to show the improved performance of the proposed method as compared with three recent baseline methods.

Authors

  • Yang Xian
    Intelligent Sensing and Communications Research Group, School of Engineering, Newcastle University, Newcastle upon, Tyne NE1 7RU, UK; College of Computer and Communication Engineering, ZhengZhou University of Light Industry, Zhengzhou, China. Electronic address: Y.xian2@newcastle.ac.uk.
  • Yang Sun
    Department of Gastroenterology, First Affiliated Hospital of Kunming Medical University, Kunming, China.
  • Wenwu Wang
    Center for Vision Speech and Signal Processing, Department of Electrical and Electronic Engineering, University of Surrey, Surrey GU2 7XH, UK. Electronic address: W.wang@surrey.ac.uk.
  • Syed Mohsen Naqvi
    Intelligent Sensing and Communications Research Group, School of Engineering, Newcastle University, Newcastle upon, Tyne NE1 7RU, UK. Electronic address: Mohsen.naqvi@newcastle.ac.uk.