Leveraging Embedding Techniques in Multimodal Machine Learning for Mental Illness Assessment
Journal:
arXiv
Published Date:
Apr 2, 2025
Abstract
The increasing global prevalence of mental disorders, such as depression and
PTSD, requires objective and scalable diagnostic tools. Traditional clinical
assessments often face limitations in accessibility, objectivity, and
consistency. This paper investigates the potential of multimodal machine
learning to address these challenges, leveraging the complementary information
available in text, audio, and video data. Our approach involves a comprehensive
analysis of various data preprocessing techniques, including novel chunking and
utterance-based formatting strategies. We systematically evaluate a range of
state-of-the-art embedding models for each modality and employ Convolutional
Neural Networks (CNNs) and Bidirectional LSTM Networks (BiLSTMs) for feature
extraction. We explore data-level, feature-level, and decision-level fusion
techniques, including a novel integration of Large Language Model (LLM)
predictions. We also investigate the impact of replacing Multilayer Perceptron
classifiers with Support Vector Machines. We extend our analysis to severity
prediction using PHQ-8 and PCL-C scores and multi-class classification
(considering co-occurring conditions). Our results demonstrate that
utterance-based chunking significantly improves performance, particularly for
text and audio modalities. Decision-level fusion, incorporating LLM
predictions, achieves the highest accuracy, with a balanced accuracy of 94.8%
for depression and 96.2% for PTSD detection. The combination of CNN-BiLSTM
architectures with utterance-level chunking, coupled with the integration of
external LLM, provides a powerful and nuanced approach to the detection and
assessment of mental health conditions. Our findings highlight the potential of
MMML for developing more accurate, accessible, and personalized mental
healthcare tools.