A Multimodal Attention Fusion-Based Model for Pathological Voice Detection.

Journal: Biomedical physics & engineering express
Published Date:

Abstract

Pathological voice detection provides a non-invasive and cost-effective approach for the early screening of laryngeal disorders. However, most existing methods rely on single-modal acoustic features and mainly focus on binary classification, which limits their ability to achieve fine-grained discrimination among multiple pathological voice categories. To address these limitations, this paper proposes a Multimodal Fusion Network (MFNet) with attention-based fusion for pathological voice classification. The proposed framework jointly models time-domain and frequency-domain information within a unified architecture. Specifically, SincNet is employed to extract task-relevant narrowband acoustic features directly from raw speech waveforms, while the Tunable Q-factor Wavelet Transform (TQWT) and SKNet are combined to learn multi-scale spectral representations from Mel-spectrograms. In addition, an Attentional Feature Fusion (AFF) module is introduced to adaptively integrate features from the two branches, thereby enhancing the discriminative capability of pathological voice representations. Comparative experiments on three public pathological voice datasets show that the proposed method outperforms representative baseline models and demonstrates strong robustness and generalization across different datasets. Furthermore, ablation studies further validate the effectiveness of the proposed architectural design. These results indicate that MFNet provides an effective solution for pathological voice classification and has promising potential for computer-aided voice disorder screening.

Authors

Keywords

No keywords available for this article.