Exploring refined dual visual features cross-combination for image captioning.

Journal: Neural networks : the official journal of the International Neural Network Society
PMID:

Abstract

For current image caption tasks used to encode region features and grid features Transformer-based encoders have become commonplace, because of their multi-head self-attention mechanism, the encoder can better capture the relationship between different regions in the image and contextual information. However, stacking Transformer blocks necessitates quadratic computation through self-attention to visual features, not only resulting in the computation of numerous redundant features but also significantly increasing computational overhead. This paper presents a novel Distilled Cross-Combination Transformer (DCCT) network. Technically, we first introduce a distillation cascade fusion encoder (DCFE), where a probabilistic sparse self-attention layer is used to filter out some redundant and distracting features that affect attention focus, aiming to obtain more refined visual features and enhance encoding efficiency. Next, we develop a parallel cross-fusion attention module (PCFA) that fully exploits the complementarity and correlation between grid and region features to better fuse the encoded dual visual features. Extensive experiments conducted on the MSCOCO dataset demonstrate that our proposed DCCT method achieves outstanding performance, rivaling current state-of-the-art approaches.

Authors

  • Junbo Hu
    Department of Pathology, Maternal and Child Hospital of Hubei Province, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei, China. cqjbhu@163.com.
  • Zhixin Li
    School of Microelectronics and Control Engineering, Changzhou University, Changzhou 213000, China.
  • Qiang Su
    Guizhou University of Traditional Chinese Medicine, Guiyang, Guizhou Province, China.
  • Zhenjun Tang
    Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin 541004, China; Guangxi Key Lab of Multi-source Information Mining and Security, Guangxi Normal University, Guilin 541004, China. Electronic address: zjtang@gxnu.edu.cn.
  • Huifang Ma