Speech as a Multimodal Digital Phenotype for Multi-Task LLM-based Mental Health Prediction
Journal:
arXiv
Published Date:
May 28, 2025
Abstract
Speech is a noninvasive digital phenotype that can offer valuable insights
into mental health conditions, but it is often treated as a single modality. In
contrast, we propose the treatment of patient speech data as a trimodal
multimedia data source for depression detection. This study explores the
potential of large language model-based architectures for speech-based
depression prediction in a multimodal regime that integrates speech-derived
text, acoustic landmarks, and vocal biomarkers. Adolescent depression presents
a significant challenge and is often comorbid with multiple disorders, such as
suicidal ideation and sleep disturbances. This presents an additional
opportunity to integrate multi-task learning (MTL) into our study by
simultaneously predicting depression, suicidal ideation, and sleep disturbances
using the multimodal formulation. We also propose a longitudinal analysis
strategy that models temporal changes across multiple clinical interactions,
allowing for a comprehensive understanding of the conditions' progression. Our
proposed approach, featuring trimodal, longitudinal MTL is evaluated on the
Depression Early Warning dataset. It achieves a balanced accuracy of 70.8%,
which is higher than each of the unimodal, single-task, and non-longitudinal
methods.