Leveraging Audio and Text Modalities in Mental Health: A Study of LLMs Performance
Journal:
arXiv
Published Date:
Dec 9, 2024
Abstract
Mental health disorders are increasingly prevalent worldwide, creating an
urgent need for innovative tools to support early diagnosis and intervention.
This study explores the potential of Large Language Models (LLMs) in multimodal
mental health diagnostics, specifically for detecting depression and Post
Traumatic Stress Disorder through text and audio modalities. Using the E-DAIC
dataset, we compare text and audio modalities to investigate whether LLMs can
perform equally well or better with audio inputs. We further examine the
integration of both modalities to determine if this can enhance diagnostic
accuracy, which generally results in improved performance metrics. Our analysis
specifically utilizes custom-formulated metrics; Modal Superiority Score and
Disagreement Resolvement Score to evaluate how combined modalities influence
model performance. The Gemini 1.5 Pro model achieves the highest scores in
binary depression classification when using the combined modality, with an F1
score of 0.67 and a Balanced Accuracy (BA) of 77.4%, assessed across the full
dataset. These results represent an increase of 3.1% over its performance with
the text modality and 2.7% over the audio modality, highlighting the
effectiveness of integrating modalities to enhance diagnostic accuracy.
Notably, all results are obtained in zero-shot inferring, highlighting the
robustness of the models without requiring task-specific fine-tuning. To
explore the impact of different configurations on model performance, we conduct
binary, severity, and multiclass tasks using both zero-shot and few-shot
prompts, examining the effects of prompt variations on performance. The results
reveal that models such as Gemini 1.5 Pro in text and audio modalities, and
GPT-4o mini in the text modality, often surpass other models in balanced
accuracy and F1 scores across multiple tasks.