Deep Speech Synthesis from Multimodal Articulatory Representations
Journal:
arXiv
Published Date:
Dec 17, 2024
Abstract
The amount of articulatory data available for training deep learning models
is much less compared to acoustic speech data. In order to improve
articulatory-to-acoustic synthesis performance in these low-resource settings,
we propose a multimodal pre-training framework. On single-speaker speech
synthesis tasks from real-time magnetic resonance imaging and surface
electromyography inputs, the intelligibility of synthesized outputs improves
noticeably. For example, compared to prior work, utilizing our proposed
transfer learning methods improves the MRI-to-speech performance by 36% word
error rate. In addition to these intelligibility results, our multimodal
pre-trained models consistently outperform unimodal baselines on three
objective and subjective synthesis quality metrics.