Mixture of Experts for Recognizing Depression from Interview and Reading Tasks
Journal:
arXiv
Published Date:
Feb 27, 2025
Abstract
Depression is a mental disorder and can cause a variety of symptoms,
including psychological, physical, and social. Speech has been proved an
objective marker for the early recognition of depression. For this reason, many
studies have been developed aiming to recognize depression through speech.
However, existing methods rely on the usage of only the spontaneous speech
neglecting information obtained via read speech, use transcripts which are
often difficult to obtain (manual) or come with high word-error rates
(automatic), and do not focus on input-conditional computation methods. To
resolve these limitations, this is the first study in depression recognition
task obtaining representations of both spontaneous and read speech, utilizing
multimodal fusion methods, and employing Mixture of Experts (MoE) models in a
single deep neural network. Specifically, we use audio files corresponding to
both interview and reading tasks and convert each audio file into log-Mel
spectrogram, delta, and delta-delta. Next, the image representations of the two
tasks pass through shared AlexNet models. The outputs of the AlexNet models are
given as input to a multimodal fusion method. The resulting vector is passed
through a MoE module. In this study, we employ three variants of MoE, namely
sparsely-gated MoE and multilinear MoE based on factorization. Findings suggest
that our proposed approach yields an Accuracy and F1-score of 87.00% and 86.66%
respectively on the Androids corpus.