Multi-modal long document classification based on Hierarchical Prompt and Multi-modal Transformer.

Journal: Neural networks : the official journal of the International Neural Network Society

Published Date: Apr 16, 2024

Abstract

In the realm of long document classification (LDC), previous research has predominantly focused on modeling unimodal texts, overlooking the potential of multi-modal documents incorporating images. To address this gap, we introduce an innovative approach for multi-modal long document classification based on the Hierarchical Prompt and Multi-modal Transformer (HPMT). The proposed HPMT method facilitates multi-modal interactions at both the section and sentence levels, enabling a comprehensive capture of hierarchical structural features and complex multi-modal associations of long documents. Specifically, a Multi-scale Multi-modal Transformer (MsMMT) is tailored to capture the multi-granularity correlations between sentences and images. This is achieved through the incorporation of multi-scale convolutional kernels on sentence features, enhancing the model's ability to discern intricate patterns. Furthermore, to facilitate cross-level information interaction and promote learning of specific features at different levels, we introduce a Hierarchical Prompt (HierPrompt) block. This block incorporates section-level prompts and sentence-level prompts, both derived from a global prompt via distinct projection networks. Extensive experiments are conducted on four challenging multi-modal long document datasets. The results conclusively demonstrate the superiority of our proposed method, showcasing its performance advantages over existing techniques.

Authors

Tengfei Liu

Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing Institute of Artificial Intelligence, Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China.
Yongli Hu

Institute for Infocomm Research, A*STAR, 1 Fusionopolis Way, #21-01 Connexis (South Tower), Singapore, Singapore. huy@i2r.a-star.edu.sg.
Junbin Gao
Jiapu Wang

Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing Institute of Artificial Intelligence, Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China.
Yanfeng Sun

Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing Artificial Intelligence Institute, Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China. Electronic address: yfsun@bjut.edu.cn.
Baocai Yin

iFLYTEK Research, Hefei, China.

Keywords

Algorithms Humans Natural Language Processing Neural Networks, Computer

External Resources

View on PubMed Access via DOI PubMed (38653128)

Multi-modal long document classification based on Hierarchical Prompt and Multi-modal Transformer.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

Multi-modal long document classification based on Hierarchical Prompt and Multi-modal Transformer.

Abstract

Authors

Keywords

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals