On the localness modeling for the self-attention based end-to-end speech synthesis.

Journal: Neural networks : the official journal of the International Neural Network Society
Published Date:

Abstract

Attention based end-to-end speech synthesis achieves better performance in both prosody and quality compared to the conventional "front-end"-"back-end" structure. But training such end-to-end framework is usually time-consuming because of the use of recurrent neural networks. To enable parallel calculation and long-range dependency modeling, a solely self-attention based framework named Transformer is proposed recently in the end-to-end family. However, it lacks position information in sequential modeling, so that the extra position representation is crucial to achieve good performance. Besides, the weighted sum form of self-attention is conducted over the whole input sequence when computing latent representation, which may disperse the attention to the whole input sequence other than focusing on the more important neighboring input states, resulting in generation errors. In this paper, we introduce two localness modeling methods to enhance the self-attention based representation for speech synthesis, which maintain the abilities of parallel computation and global-range dependency modeling in self-attention while improving the generation stability. We systematically analyze the solely self-attention based end-to-end speech synthesis framework, and unveil the importance of local context. Then we add the proposed relative-position-aware method to enhance local edges and experiment with different architectures to examine the effectiveness of localness modeling. In order to achieve query-specific window and discard the hyper-parameter of the relative-position-aware approach, we further conduct Gaussian-based bias to enhance localness. Experimental results indicate that the two proposed localness enhanced methods can both improve the performance of the self-attention model, especially when applied to the encoder part. And the query-specific window of Gaussian bias approach is more robust compared with the fixed relative edges.

Authors

  • Shan Yang
    Clinic of Radiology and Nuclear Medicine, University Hospital Basel, Petersgraben 4, 4031, Basel, Switzerland.
  • Heng Lu
    Tencent AI Lab, China. Electronic address: bearlu@tencent.com.
  • Shiyin Kang
    Tencent AI Lab, China. Electronic address: shiyinkang@tencent.com.
  • Liumeng Xue
    Audio, Speech and Language Processing Group (ASLP@NPU), National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, School of Computer Science, Northwestern Polytechnical University, Xi'an, China. Electronic address: lmxue@nwpu-aslp.org.
  • Jinba Xiao
    Audio, Speech and Language Processing Group (ASLP@NPU), National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, School of Computer Science, Northwestern Polytechnical University, Xi'an, China. Electronic address: usar@npu-aslp.org.
  • Dan Su
    Tencent AI Lab, China. Electronic address: dansu@tencent.com.
  • Lei Xie
    Ph.D. Program in Computer Science, The City University of New York, New York, NY, United States.
  • Dong Yu
    State Key Laboratory of Genetic Engineering, School of Life Sciences and Shanghai Cancer Hospital/Cancer Institute, Fudan University, Shanghai 200438, China.