Dual Position Relationship Transformer for Image Captioning.

Journal: Big data
Published Date:

Abstract

Employing feature vectors extracted from the target detector has been shown to be effective in improving the performance of image captioning. However, it is considered that existing framework suffers from the deficiency of insufficient information extraction, such as positional relationships; it is very important to judge the relationship between objects. To fill this gap, we present a dual position relationship transformer (DPR) for image captioning; the architecture improves the image information extraction and description coding steps: it first calculates the relative position (RP) and absolute position (AP) between objects, and integrates the dual position relationship information into self-attention. Specifically, convolutional neural network (CNN) and faster R-CNN are applied to extract image features and target detection, then to calculate the RP and AP of the generated object boxes. The former is expressed in coordinate form, and the latter is calculated by sinusoidal encoding. In addition, to better model the sequence and time relationship in the description, DPR adopts long short-term memory to encode text vector. We conduct extensive experiments on the Microsoft COCO: Common Objects in Context (MSCOCO) image captioning data set that shows that our method achieves superior performance that Consensus-based Image Description Evaluation (CIDEr) increased to 114.6 after training 30 epochs and runs 2 times faster, compared with other competitive methods. The ablation study verifies the effectiveness of our proposed module.

Authors

  • Yaohan Wang
    Department of Information Science and Engineering, Yunnan University, Kunming, China.
  • Wenhua Qian
    School of Information Science and Engineering, Yunnan University, Kunming, Yunnan 650091, China qwhua003@sina.com.
  • Rencan Nie
    School of Information Science and Engineering, Yunnan University, Kunming, Yunnan 650091, China, and School of Automation, Southeast University, Jiangsu, Nanjing 210096, China rcnie@ynu.edu.cn.
  • Dan Xu
    Department of Orthodontics, The Affiliated Stomatological Hospital of Southwest Medical University, Luzhou, China.
  • Jinde Cao
  • Pyoungwon Kim
    College of Education Incheon National University, Incheon, Korea.