DGPrompt: Dual-guidance prompts generation for vision-language models.

Journal: Neural networks : the official journal of the International Neural Network Society

Published Date: Apr 19, 2025

Abstract

Introducing learnable prompts into CLIP and fine-tuning them have demonstrated excellent performance across many downstream tasks. However, existing methods have insufficient interaction between modalities and neglect the importance of hierarchical contextual information, leading to ineffective alignment in both the visual and textual representation spaces. Additionally, CLIP is highly sensitive to prompts, making learnable prompts prone to overfitting on seen classes, which results in the forgetting of general knowledge of CLIP and severely impair generalization ability on unseen classes. To address these issues, we propose an original Dual-Guidance Prompts Generation (DGPrompt) method that promotes alignment between visual and textual spaces while ensuring the continuous retention of general knowledge. The main ideas of DGPrompt are as follows: 1) The extraction of image and text embeddings are guided mutually by generating visual and textual prompts, making full use of complementary information from both modalities to align visual and textual spaces. 2) The prompt-tuning process is restrained by a retention module, reducing the forgetting of general knowledge. Extensive experiments conducted in settings of base-to-new class generalization and few-shot learning demonstrate the superiority of the proposed method. Compared with the baseline method CLIP and the state-of-the-art method MaPLe, DGPrompt exhibits favorable performance and achieves an absolute gain of 7.84% and 0.99% on overall harmonic mean, averaged over 11 diverse image recognition datasets.

Authors

Tai Zheng

School of Software, Shandong University, 1500 Shunhua Road, Jinan 250101, China. Electronic address: zt5369623@gmail.com.
Zhen-Duo Chen

School of Software, Shandong University, 1500 Shunhua Road, Jinan 250101, China. Electronic address: chenzd.sdu@gmail.com.
Zi-Chao Zhang

School of Software, Shandong University, 1500 Shunhua Road, Jinan 250101, China. Electronic address: zhangzichao1008@163.com.
Zhen-Xiang Ma

School of Software, Shandong University, 1500 Shunhua Road, Jinan 250101, China. Electronic address: mazhenxiang0923@163.com.
Li-Jun Zhao

School of Software, Shandong University, 1500 Shunhua Road, Jinan 250101, China. Electronic address: lj_zhao1028@163.com.
Chong-Yu Zhang

School of Software, Shandong University, 1500 Shunhua Road, Jinan 250101, China. Electronic address: zhangchongyu22@gmail.com.
Xin Luo

Department of Pharmacology, The Basic Medical Sciences College of Xinjiang Medical University Urumqi 830054, China.
Xin-Shun Xu

School of Software, Shandong University, 1500 Shunhua Road, Jinan 250101, China. Electronic address: xuxinshun@sdu.edu.cn.

Keywords

Humans Language Neural Networks, Computer

External Resources

View on PubMed Access via DOI PubMed (40279820)

DGPrompt: Dual-guidance prompts generation for vision-language models.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

DGPrompt: Dual-guidance prompts generation for vision-language models.

Abstract

Authors

Keywords

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals