CMDF-TTS: Text-to-speech method with limited target speaker corpus.

Journal: Neural networks : the official journal of the International Neural Network Society
Published Date:

Abstract

While end-to-end Text-to-Speech (TTS) methods with limited target speaker corpus can generate high-quality speech, they often require a non-target speaker corpus (auxiliary corpus) which contains a substantial amount of pairs to train the model, significantly increasing training costs. In this work, we propose a fast and high-quality speech synthesis approach, requiring few target speaker recordings. Based on statistics, we analyzed the role of phonemes, function words, and utterance target domains in the corpus and proposed a Statistical-based Compression Auxiliary Corpus algorithm (SCAC). It significantly improves model training speed without a noticeable decrease in speech naturalness. Next, we use the compressed corpus to train the proposed non-autoregressive model CMDF-TTS, which uses a multi-level prosody modeling module to obtain more information and Denoising Diffusion Probabilistic Models (DDPMs) to generate mel-spectrograms. Besides, we fine-tune the model using the target speaker corpus to embed the speaker's characteristics into the model and Conditional Variational Auto-Encoder Generative Adversarial Networks(CVAE-GAN) to enhance further the synthesized speech's quality. Experimental results on multiple Mandarin and English corpus demonstrate that the CMDF-TTS model, enhanced by the SCAC algorithm, effectively balances training speed and synthesized speech quality. Overall, its performance surpasses that of state-of-the-art models.

Authors

  • Ye Tao
    Department of Gastroenterology, The First Affiliated Hospital of Zhejiang Chinese Medical University, Hangzhou, China.
  • Jiawang Liu
    Department of Chemistry, Xavier University of Louisiana, New Orleans, LA 70125, USA.
  • Chaofeng Lu
    School of Information Science and Technology, Qingdao University of Science and Technology, Qingdao 266061, PR China. Electronic address: 2579356425@qq.com.
  • Meng Liu
  • Xiugong Qin
    Beijing Research Institute of Automation for Machinery Industry Co., Ltd., Beijing 100000, PR China. Electronic address: 13121990213@163.com.
  • Yunlong Tian
    National Engineering Research Center of Digital Home Networking, Qingdao 266000, PR China. Electronic address: tianyl@haier.com.
  • Yongjie Du
    National Engineering Research Center of Digital Home Networking, Qingdao 266000, PR China. Electronic address: tcldu@163.com.