Can Multimodal Large Language Models be Guided to Improve Industrial Anomaly Detection?
Journal:
arXiv
Published Date:
Jan 27, 2025
Abstract
In industrial settings, the accurate detection of anomalies is essential for
maintaining product quality and ensuring operational safety. Traditional
industrial anomaly detection (IAD) models often struggle with flexibility and
adaptability, especially in dynamic production environments where new defect
types and operational changes frequently arise. Recent advancements in
Multimodal Large Language Models (MLLMs) hold promise for overcoming these
limitations by combining visual and textual information processing
capabilities. MLLMs excel in general visual understanding due to their training
on large, diverse datasets, but they lack domain-specific knowledge, such as
industry-specific defect tolerance levels, which limits their effectiveness in
IAD tasks. To address these challenges, we propose Echo, a novel multi-expert
framework designed to enhance MLLM performance for IAD. Echo integrates four
expert modules: Reference Extractor which provides a contextual baseline by
retrieving similar normal images, Knowledge Guide which supplies
domain-specific insights, Reasoning Expert which enables structured, stepwise
reasoning for complex queries, and Decision Maker which synthesizes information
from all modules to deliver precise, context-aware responses. Evaluated on the
MMAD benchmark, Echo demonstrates significant improvements in adaptability,
precision, and robustness, moving closer to meeting the demands of real-world
industrial anomaly detection.