Deep Speech Synthesis from Multimodal Articulatory Representations

Journal: arXiv

Published Date: Dec 17, 2024

Abstract

The amount of articulatory data available for training deep learning models is much less compared to acoustic speech data. In order to improve articulatory-to-acoustic synthesis performance in these low-resource settings, we propose a multimodal pre-training framework. On single-speaker speech synthesis tasks from real-time magnetic resonance imaging and surface electromyography inputs, the intelligibility of synthesized outputs improves noticeably. For example, compared to prior work, utilizing our proposed transfer learning methods improves the MRI-to-speech performance by 36% word error rate. In addition to these intelligibility results, our multimodal pre-trained models consistently outperform unimodal baselines on three objective and subjective synthesis quality metrics.

Authors

Peter Wu
Bohan Yu
Kevin Scheck
Alan W Black
Aditi S. Krishnapriyan
Irene Y. Chen
Tanja Schultz
Shinji Watanabe
Gopala K. Anumanchipalli

External Resources

View on arXiv arXiv (http://arxiv.org/abs/2412.13387v1)

Deep Speech Synthesis from Multimodal Articulatory Representations

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

Deep Speech Synthesis from Multimodal Articulatory Representations

Abstract

Authors

Categories

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals