Towards Robust Assessment of Pathological Voices via Combined Low-Level Descriptors and Foundation Model Representations
Journal:
arXiv
Published Date:
May 27, 2025
Abstract
Perceptual voice quality assessment is essential for diagnosing and
monitoring voice disorders by providing standardized evaluations of vocal
function. Traditionally, expert raters use standard scales such as the
Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) and Grade,
Roughness, Breathiness, Asthenia, and Strain (GRBAS). However, these metrics
are subjective and prone to inter-rater variability, motivating the need for
automated, objective assessment methods. This study proposes Voice Quality
Assessment Network (VOQANet), a deep learning-based framework with an attention
mechanism that leverages a Speech Foundation Model (SFM) to extract high-level
acoustic and prosodic information from raw speech. To enhance robustness and
interpretability, we also introduce VOQANet+, which integrates low-level speech
descriptors such as jitter, shimmer, and harmonics-to-noise ratio (HNR) with
SFM embeddings into a hybrid representation. Unlike prior studies focused only
on vowel-based phonation (PVQD-A subset) of the Perceptual Voice Quality
Dataset (PVQD), we evaluate our models on both vowel-based and sentence-level
speech (PVQD-S subset) to improve generalizability. Results show that
sentence-based input outperforms vowel-based input, especially at the patient
level, underscoring the value of longer utterances for capturing perceptual
voice attributes. VOQANet consistently surpasses baseline methods in root mean
squared error (RMSE) and Pearson correlation coefficient (PCC) across CAPE-V
and GRBAS dimensions, with VOQANet+ achieving even better performance.
Additional experiments under noisy conditions show that VOQANet+ maintains high
prediction accuracy and robustness, supporting its potential for real-world and
telehealth deployment.