Reconsidering learnable fine-grained text prompts for few-shot anomaly detection in visual-language models.

Journal: Neural networks : the official journal of the International Neural Network Society
Published Date:

Abstract

Few-Shot Anomaly Detection (FSAD) in industrial images aims to identify abnormalities using only a few normal images, which is crucial for industrial scenarios where sample training is limited. The recent advances in large-scale pre-trained visual-language models have brought significant improvements to the FSAD, which typically requires hundreds of text prompts to be manually crafted through prompt engineering. However, manually designed text prompts cannot accurately match the informative features of different categories across diverse images, and the domain gap between train and test datasets can severely impact the generalization capability of text prompts. To address these issues, we propose a visual-language model based on fine-grained learnable text prompts as a unified general framework for FSAD in industry. Firstly, we design a Fine-grained Text Prompts Adapter (FTPA) and an associated registration loss to enhance the efficiency of text prompts. The manually designed text prompts are improved and optimized by capturing normal and abnormal semantic information in the image, so that the text prompts can describe the image semantic information at a finer granularity. In addition, we introduce a Dynamic Modulation Mechanism (DMM) to avoid potential errors in text prompts post-training due to the agnostic during cross-dataset detection. This is achieved by explicitly modulating the branch guided by few-shot images and the branch guided by fine-grained text prompts. Extensive experiments demonstrate that our proposed method achieves state-of-the-art few-shot industrial anomaly detection and segmentation performance. In the 4-shot, the AUROC of the anomaly classification and anomaly segmentation achieves 98.3%, 96.3%, and 93.8%, 97.9% on the MVTec-AD and VisA datasets, respectively.

Authors

  • Delong Han
    Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences), Jinan, 250353, China; Shandong Provincial Key Laboratory of Computing Power Internet and Service Computing, Shandong Fundamental Research Center for Computer Science, Jinan, 250014, China. Electronic address: handl@qlu.edu.cn.
  • Luo Xu
    School of Microelectronics and Communication Engineering, Chongqing University, 174 ShaPingBa District, Chongqing, 400044, China.
  • Mingle Zhou
    Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences), Jinan, 250353, China; Shandong Provincial Key Laboratory of Computing Power Internet and Service Computing, Shandong Fundamental Research Center for Computer Science, Jinan, 250014, China. Electronic address: zhouml@qlu.edu.cn.
  • Jin Wan
    Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences), Jinan, 250353, China; Shandong Provincial Key Laboratory of Computing Power Internet and Service Computing, Shandong Fundamental Research Center for Computer Science, Jinan, 250014, China. Electronic address: 18112033@bjtu.edu.
  • Min Li
    Hubei Provincial Institute for Food Supervision and Test, Hubei Provincial Engineering and Technology Research Center for Food Quality and Safety Test, Wuhan 430075, China.
  • Gang Li
    The Centre for Cyber Resilience and Trust, Deakin University, Australia.