A novel multi-modal retrieval framework for tracking vehicles using natural language descriptions.

Journal: PloS one
Published Date:

Abstract

Recent advances in multimodal and contrastive learning have significantly enhanced image and video retrieval capabilities. This fusion provides numerous opportunities for multi-dimensional and multi-view retrieval, especially in multi-camera surveillance scenarios in traffic environments. This paper introduces a novel Multi-modal Vehicle Retrieval (MVR) system designed to retrieve the trajectories of tracked vehicles using natural language descriptions. The MVR system integrates an end-to-end text-video comparison learning model, utilizes CLIP for feature extraction, and uses a matching control system and multi-context-based attributes. Post-processing techniques are used to eliminate erroneous information. By comprehensively understanding vehicle characteristics, the MVR system can effectively identify trajectories based on natural language descriptions. Our method achieves a mean reciprocal ranking (MRR) score of 0.8966 on the test data set of the 7th AI City Challenge Track 2 for retrieving tracked vehicles through natural language descriptions, surpassing the previous top-ranked result on the public leaderboard.

Authors

  • Changhao Zhang
    College of Computer Science and Technology, Xinjiang Normal University, Urumqi, China.
  • Zhandong Liu
    Department of Pediatrics, Baylor College of Medicine, Houston, TX, USA. zhandong.liu@bcm.edu.
  • Ke Li
    School of Ideological and Political Education, Shanghai Maritime University, Shanghai, China.
  • Yong Li
    Department of Surgical Sciences, Western Michigan University Homer Stryker M.D. School of Medicine, Kalamazoo, MI, United States.
  • Xiangwei Qi
    College of Computer Science and Technology, Xinjiang Normal University, Urumqi, China.
  • Nan Ding
    Reproductive Medicine Center, Lanzhou University Second Hospital, No.82, Cuiying Road, Chengguan District, Lanzhou City, Gansu Province, China.