A WaveNet-based model for predicting the electroglottographic signal from the acoustic voice signal.

Journal: The Journal of the Acoustical Society of America
PMID:

Abstract

The electroglottographic (EGG) signal offers a non-invasive approach to analyze phonation. It is known, if not obvious, that the onset of vocal fold contacting has a substantial effect on how the vocal folds vibrate and on the quality of the voice. Given that the presence or absence of vocal fold contacting has major consequences also for the interpretation of acoustic metrics, it is compelling to consider the possibility of predicting EGG signals directly from the microphone speech signal. This retrospective study presents a neural network model for EGG signal estimation utilizing a WaveNet architecture augmented with a self-attention mechanism. The model was trained on an existing dataset that comprehensively recorded participants' full voice range. The proposed model effectively captures the temporal dynamics and morphological characteristics of normophonic EGG waveforms, achieving outputs that closely resemble the ground truth in terms of EGG waveshape and extracted EGG metrics. For evaluation, voice mapping was used to display the distribution similarities of extracted metrics from predicted and ground truth EGG waveforms. The model exhibits proficiency in accurately estimating EGG signals in areas of stable and contacting voicing but displays reduced accuracy in transitional and breathy phonatory conditions.

Authors

  • Huanchen Cai
    Division of Speech, Music and Hearing, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, SE-100 44 Stockholm, Sweden.
  • Sten Ternström
    Division of Speech, Music and Hearing, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, SE-100 44 Stockholm, Sweden.