Multimodal Alzheimer's disease recognition from image, text and audio.

Journal: Scientific reports
Published Date:

Abstract

Alzheimer's disease (AD) is a progressive neurodegenerative disorder that significantly affects cognitive function. One widely used diagnostic approach involves analyzing patients' verbal descriptions of pictures. While prior studies have primarily focused on speech- and text-based models, the integration of visual context is still at an early stage. This study proposes a novel multimodal AD prediction model that integrates image, text, and audio modalities. The image and text modalities are processed using a vision-language model and structured as a bipartite graph before fusion, while all three modalities are integrated through a combination of co-attention-based intermediate fusion and late fusion, enabling effective inter-modality cooperation. The proposed model achieves an accuracy of 90.61%, outperforming state-of-the-art models. Furthermore, an ablation study quantifies the contribution of each modality using Shapley values, which serve as the foundation for a novel auxiliary loss function that adaptively adjusts modality importance during training. The findings indicate that integrating image, text, and audio modalities via a co-attention-based intermediate fusion enhances AD classification performance. Additionally, this study analyzes modality-specific attention patterns and key linguistic tokens, demonstrating that audio and text provide complementary cues for AD classification.

Authors

  • Byounghwa Lee
    Integrated Intelligence Research Section, Electronics and Telecommunications Research Institute, Daejeon, 34129, Republic of Korea. byounghwa.lee@etri.re.kr.
  • Hwa Jeon Song
    Integrated Intelligence Research Section, Electronics and Telecommunications Research Institute, Daejeon, 34129, Republic of Korea.
  • Young-Jin Park
    Electro-Medicine Device Research Division, Korea Electrotechnology Research Institute, Ansan, 15588, Republic of Korea.
  • Byung Ok Kang
    Integrated Intelligence Research Section, Electronics and Telecommunications Research Institute, Daejeon, 34129, Republic of Korea.