Mandarin Speech Reconstruction from Tongue Motion Ultrasound Images based on Generative Adversarial Networks.
Journal:
Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference
PMID:
40039953
Abstract
Speech impairment resulting from laryngectomy causes severe physiological and psychological distress to laryngectomee. In clinical practice, the upper vocal tract articulatory organs function normally in most laryngectomee. The potential to reconstruct speech by leveraging articulatory information is of significant importance, offering a meaningful contribution to the effective rehabilitation of speech in these patients. To begin, we created a Mandarin corpus, capturing simultaneous dynamic tongue motion ultrasound images and speech waveform during experiment. Then we utilized an autoencoder to extract deep representation from ultrasound images. Building on this, a speech waveform generation model was established using generative adversarial networks, and both objective and subjective evaluations were conducted to access the quality of the reconstructed speech. The results reveal that the phoneme accuracy of the reconstructed speech reaches 72.43%, with accuracy of Mandarin tones being 76.10%. Observing the mel-spectrogram and fundamental frequency contour, the reconstructed speech shows a high degree of similarity to original speech. Additionally, subjective speech perceptions of the reconstructed speech affirm its acceptability (mean opinion score > 6). The method presented in this paper enables to reconstruct tonal Mandarin speech from dynamic tongue motion ultrasound images. However, future research should focus on specific conditions of laryngectomee, improving and optimizing model performance, expanding training datasets, and enhancing the quality of reconstructed speech.