Keep the General, Inject the Specific: Structured Dialogue Fine-Tuning for Knowledge Injection without Catastrophic Forgetting
Journal:
arXiv
Published Date:
Apr 27, 2025
Abstract
Large Vision Language Models have demonstrated impressive versatile
capabilities through extensive multimodal pre-training, but face significant
limitations when incorporating specialized knowledge domains beyond their
training distribution. These models struggle with a fundamental dilemma: direct
adaptation approaches that inject domain-specific knowledge often trigger
catastrophic forgetting of foundational visual-linguistic abilities. We
introduce Structured Dialogue Fine-Tuning (SDFT), an effective approach that
effectively injects domain-specific knowledge while minimizing catastrophic
forgetting. Drawing inspiration from supervised fine-tuning in LLMs and
subject-driven personalization in text-to-image diffusion models, our method
employs a three-phase dialogue structure: Foundation Preservation reinforces
pre-trained visual-linguistic alignment through caption tasks; Contrastive
Disambiguation introduces carefully designed counterfactual examples to
maintain semantic boundaries; and Knowledge Specialization embeds specialized
information through chain-of-thought reasoning. Experimental results across
multiple domains confirm SDFT's effectiveness in balancing specialized
knowledge acquisition with general capability retention. Our key contributions
include a data-centric dialogue template that balances foundational alignment
with targeted knowledge integration, a weighted multi-turn supervision
framework, and comprehensive evaluation across diverse knowledge types.