Ranking Pretrained Speech Embeddings in Parkinson’s Disease Detection: Does Wav2Vec 2.0 Outperform its 1.0 Version Across Speech Modes and Languages?

Journal: medRxiv
Published Date:

Abstract

Speech and language technologies are effective tools for identifying the distinct speech changes associated with Parkinson’s disease (PD), enabling earlier and more accurate diagnosis. Recent advancements in self-supervised speech pretraining, particularly with Wav2Vec models, have demonstrated superior performance over traditional feature extraction methods. While Wav2Vec 2.0 has been successfully utilized for PD detection, a rigorous quantitative comparison with Wav2Vec 1.0 is needed to comprehensively evaluate its advantages, limitations, and applicability across different speech modes in PD. This study presents a systematic comparison of Wav2Vec 1.0 and Wav2Vec 2.0 embeddings across three multilingual datasets using various classification approaches in classifying normal (healthy controls; HC) and PD speech. Additionally, both Wav2Vec versions were benchmarked against traditional baseline features across diverse linguistic contexts, including spontaneous speech, non-spontaneous speech, and isolated vowels. A multicriteria TOPSIS approach was employed to rank feature extraction methods, revealing that the Wav2Vec 2.0 consistently excelled across all speech modes, with its first transformer layer demonstrating the best performance for contextual tasks (read text and monologue) and its feature extractor performing best in vowel-based classification. In contrast, the Wav2Vec 1.0, while generally outperformed by the Wav2Vec 2.0, still provided a faster alternative with competitive performance in contextual tasks, highlighting its potential for specific applications, such as federated learning. This comparative analysis furthermore underscores the strengths of each Wav2Vec architecture and informs their optimal use in PD detection.

Authors

  • Ondrej Klempir; Adela Skryjova; Ales Tichopad; Radim Krupicka