A Sentiment Pre-trained Text-Guided Multimodal Cross-Attention Transformer for Improved Depression Detection.
Journal:
Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference
PMID:
40040039
Abstract
Depression is a widespread mental health issue requiring efficient automated detection methods. Traditional single-modality approaches are less effective due to the disorder's complexity, leading to a focus on multimodal analysis. Recent advancements include transformer-based fusion methods, yet their application in depression detection is often limited by the dominant text modality. To address this, we propose the Text-Guided Multimodal Cross-Attention Transformer, enhancing cross-modal interactions between text, audio, and video for more effective depression detection. Our approach uniquely pre-trains encoders on a large sentiment dataset to better capture emotion-related features crucial for identifying depression-related sentiment changes. Our method demonstrates superior performance on the AVEC2019 benchmark, outperforming current state-of-the-art depression detection techniques.