Dual Position Relationship Transformer for Image Captioning.

Journal: Big data

Published Date: Jan 4, 2022

Abstract

Employing feature vectors extracted from the target detector has been shown to be effective in improving the performance of image captioning. However, it is considered that existing framework suffers from the deficiency of insufficient information extraction, such as positional relationships; it is very important to judge the relationship between objects. To fill this gap, we present a dual position relationship transformer (DPR) for image captioning; the architecture improves the image information extraction and description coding steps: it first calculates the relative position (RP) and absolute position (AP) between objects, and integrates the dual position relationship information into self-attention. Specifically, convolutional neural network (CNN) and faster R-CNN are applied to extract image features and target detection, then to calculate the RP and AP of the generated object boxes. The former is expressed in coordinate form, and the latter is calculated by sinusoidal encoding. In addition, to better model the sequence and time relationship in the description, DPR adopts long short-term memory to encode text vector. We conduct extensive experiments on the Microsoft COCO: Common Objects in Context (MSCOCO) image captioning data set that shows that our method achieves superior performance that Consensus-based Image Description Evaluation (CIDEr) increased to 114.6 after training 30 epochs and runs 2 times faster, compared with other competitive methods. The ablation study verifies the effectiveness of our proposed module.

Authors

Yaohan Wang

Department of Information Science and Engineering, Yunnan University, Kunming, China.
Wenhua Qian

School of Information Science and Engineering, Yunnan University, Kunming, Yunnan 650091, China qwhua003@sina.com.
Rencan Nie

School of Information Science and Engineering, Yunnan University, Kunming, Yunnan 650091, China, and School of Automation, Southeast University, Jiangsu, Nanjing 210096, China rcnie@ynu.edu.cn.
Dan Xu

Department of Orthodontics, The Affiliated Stomatological Hospital of Southwest Medical University, Luzhou, China.
Jinde Cao
Pyoungwon Kim

College of Education Incheon National University, Incheon, Korea.

Keywords

Information Storage and Retrieval Neural Networks, Computer

External Resources

View on PubMed Access via DOI PubMed (34981961)

Dual Position Relationship Transformer for Image Captioning.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals