Concept-as-Tree: Synthetic Data is All You Need for VLM Personalization
Journal:
arXiv
Published Date:
Mar 17, 2025
Abstract
Vision-Language Models (VLMs) have demonstrated exceptional performance in
various multi-modal tasks. Recently, there has been an increasing interest in
improving the personalization capabilities of VLMs. To better integrate
user-provided concepts into VLMs, many methods use positive and negative
samples to fine-tune these models. However, the scarcity of user-provided
positive samples and the low quality of retrieved negative samples pose
challenges for fine-tuning. To reveal the relationship between sample and model
performance, we systematically investigate the impact of positive and negative
samples (easy and hard) and their diversity on VLM personalization tasks. Based
on the detailed analysis, we introduce Concept-as-Tree (CaT), which represents
a concept as a tree structure, thereby enabling the data generation of positive
and negative samples with varying difficulty and diversity for VLM
personalization. With a well-designed data filtering strategy, our CaT
framework can ensure the quality of generated data, constituting a powerful
pipeline. We perform thorough experiments with various VLM personalization
baselines to assess the effectiveness of the pipeline, alleviating the lack of
positive samples and the low quality of negative samples. Our results
demonstrate that CaT equipped with the proposed data filter significantly
enhances the personalization capabilities of VLMs across the MyVLM, Yo'LLaVA,
and MC-LLaVA datasets. To our knowledge, this work is the first controllable
synthetic data pipeline for VLM personalization. The code is released at
$\href{https://github.com/zengkaiya/CaT}{\text{https://github.com/zengkaiya/CaT}}$.