Depression detection methods based on multimodal fusion of voice and text.

Journal: Scientific reports

Published Date: Jul 1, 2025

Abstract

Depression is a prevalent mental health disorder, and early detection is crucial for timely intervention. Traditional diagnostics often rely on subjective judgments, leading to variability and inefficiency. This study proposes a fusion model for automated depression detection, leveraging bimodal data from voice and text. Wav2Vec 2.0 and BERT pre-trained models were utilized for feature extraction, while a multi-scale convolutional layer and Bi-LSTM network were employed for feature fusion and classification. Adaptive pooling was used to integrate features, enabling simultaneous depression classification and PHQ-8 severity estimation within a unified system.Experiments on the CMDC and DAIC datasets demonstrate the model's effectiveness. On CMDC, the F1 score improved by 0.0103 and 0.2017 compared to voice-only and text-only models, respectively, while RMSE decreased by 0.5186. On DAIC, the F1 score increased by 0.0645 and 0.2589, with RMSE reduced by 1.9901. These results highlight the proposed method's ability to capture and integrate multi-level information across modalities, significantly improving the accuracy and reliability of automated depression detection and severity prediction.

Authors

Zhenrong Xu

School of Biomedical Engineering, South-Central Minzu University, Wuhan, 430074, China.
Yuan Gao

Engineering Research Center of EMR and Intelligent Expert System, Ministry of Education, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou Zhejiang Province, China.
Fang Wang

Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Xiangtan, China.
Longqian Zhang

School of Biomedical Engineering, South-Central Minzu University, Wuhan, 430074, China.
Li Zhang

Department of Animal Nutrition and Feed Science, College of Animal Science and Technology, Huazhong Agricultural University, Wuhan 430070, China.
Junke Wang

School of Biomedical Engineering, South-Central Minzu University, Wuhan, 430074, China.
Jie Shu

School of Biomedical Engineering, South-Central Minzu University, Wuhan, 430074, China.

Keywords

Deep Learning Depression Female Humans Voice

External Resources

View on PubMed Access via DOI PubMed (40593978)

Depression detection methods based on multimodal fusion of voice and text.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

Depression detection methods based on multimodal fusion of voice and text.

Abstract

Authors

Keywords

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals