CONCORD: Concept-Informed Diffusion for Dataset Distillation
Journal:
arXiv
Published Date:
May 23, 2025
Abstract
Dataset distillation (DD) has witnessed significant progress in creating
small datasets that encapsulate rich information from large original ones.
Particularly, methods based on generative priors show promising performance,
while maintaining computational efficiency and cross-architecture
generalization. However, the generation process lacks explicit controllability
for each sample. Previous distillation methods primarily match the real
distribution from the perspective of the entire dataset, whereas overlooking
concept completeness at the instance level. The missing or incorrectly
represented object details cannot be efficiently compensated due to the
constrained sample amount typical in DD settings. To this end, we propose
incorporating the concept understanding of large language models (LLMs) to
perform Concept-Informed Diffusion (CONCORD) for dataset distillation.
Specifically, distinguishable and fine-grained concepts are retrieved based on
category labels to inform the denoising process and refine essential object
details. By integrating these concepts, the proposed method significantly
enhances both the controllability and interpretability of the distilled image
generation, without relying on pre-trained classifiers. We demonstrate the
efficacy of CONCORD by achieving state-of-the-art performance on ImageNet-1K
and its subsets. The code implementation is released in
https://github.com/vimar-gu/CONCORD.