Pre-gating and contextual attention gate - A new fusion method for multi-modal data tasks.

Journal: Neural networks : the official journal of the International Neural Network Society
Published Date:

Abstract

Multi-modal representation learning has received significant attention across diverse research domains due to its ability to model a scenario comprehensively. Learning the cross-modal interactions is essential to combining multi-modal data into a joint representation. However, conventional cross-attention mechanisms can produce noisy and non-meaningful values in the absence of useful cross-modal interactions among input features, thereby introducing uncertainty into the feature representation. These factors have the potential to degrade the performance of downstream tasks. This paper introduces a novel Pre-gating and Contextual Attention Gate (PCAG) module for multi-modal learning comprising two gating mechanisms that operate at distinct information processing levels within the deep learning model. The first gate filters out interactions that lack informativeness for the downstream task, while the second gate reduces the uncertainty introduced by the cross-attention module. Experimental results on eight multi-modal classification tasks spanning various domains show that the multi-modal fusion model with PCAG outperforms state-of-the-art multi-modal fusion models. Additionally, we elucidate how PCAG effectively processes cross-modality interactions.

Authors

  • Duoyi Zhang
    Centre for Data Science, School of Computer Science, Queensland University of Technology, 4000, Brisbane, Australia. Electronic address: duoyi.zhang@hdr.qut.edu.au.
  • Richi Nayak
  • Md Abul Bashar
    Centre for Data Science, School of Computer Science, Queensland University of Technology, 4000, Brisbane, Australia. Electronic address: m1.bashar@qut.edu.au.