Exploring refined dual visual features cross-combination for image captioning.

Journal: Neural networks : the official journal of the International Neural Network Society

PMID: 39270347

Abstract

For current image caption tasks used to encode region features and grid features Transformer-based encoders have become commonplace, because of their multi-head self-attention mechanism, the encoder can better capture the relationship between different regions in the image and contextual information. However, stacking Transformer blocks necessitates quadratic computation through self-attention to visual features, not only resulting in the computation of numerous redundant features but also significantly increasing computational overhead. This paper presents a novel Distilled Cross-Combination Transformer (DCCT) network. Technically, we first introduce a distillation cascade fusion encoder (DCFE), where a probabilistic sparse self-attention layer is used to filter out some redundant and distracting features that affect attention focus, aiming to obtain more refined visual features and enhance encoding efficiency. Next, we develop a parallel cross-fusion attention module (PCFA) that fully exploits the complementarity and correlation between grid and region features to better fuse the encoded dual visual features. Extensive experiments conducted on the MSCOCO dataset demonstrate that our proposed DCCT method achieves outstanding performance, rivaling current state-of-the-art approaches.

Authors

Junbo Hu

Department of Pathology, Maternal and Child Hospital of Hubei Province, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei, China. cqjbhu@163.com.
Zhixin Li

School of Microelectronics and Control Engineering, Changzhou University, Changzhou 213000, China.
Qiang Su

Guizhou University of Traditional Chinese Medicine, Guiyang, Guizhou Province, China.
Zhenjun Tang

Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin 541004, China; Guangxi Key Lab of Multi-source Information Mining and Security, Guangxi Normal University, Guilin 541004, China. Electronic address: zjtang@gxnu.edu.cn.
Huifang Ma

Keywords

Algorithms Attention Humans Image Processing, Computer-Assisted Neural Networks, Computer Visual Perception

External Resources

View on PubMed Access via DOI PubMed (39270347)

Exploring refined dual visual features cross-combination for image captioning.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals