CLOP-DiT: Structured-Metadata-Conditioned Single-Cell Latent Generation via Contrastive Language-Omics Pretraining and Diffusion Transformers

Journal: bioRxiv
Published Date:

Abstract

Generating realistic single-cell transcriptomic profiles from structured biological descriptions would enable controlled simulation, data augmentation, and hypothesis-driven cell-state creation---yet no existing method combines text--cell alignment with conditional generation. We present CLOP-DiT, a modular three-stage pipeline: (1) a contrastive aligner (CLOP) maps BiomedBERT text embeddings and scGPT cell embeddings into a shared 512-dimensional space; (2) a conditional Diffusion Transformer (DiT) generates scGPT-compatible latent states via flow matching, steered by a five-field biological template (cell type, tissue, organism, marker genes, disease); and (3) a frozen scGPT decoder maps latents to gene expression. Across 69 cell types from 80 GEO datasets (220,304 cells), a high-fidelity regime (CFG = 2.0) achieves 36.9\% KNN accuracy (25x chance) and 81.0% steering, while a high-diversity regime (CFG = 1.0) reaches diversity ratio 0.93 at 80.7% steering. Conditioning field ablation and swap-label permutation tests confirm that marker genes are the dominant steering signal (steering accuracy drops from 99.8% to 62.4% when only metadata fields are retained). Key limitations are identified transparently: in-distribution per-gene variance structure is well preserved (r = 0.98) but cross-dataset variance correlation drops to near zero, the discriminator AUC of 0.656 indicates residual distinguishability, and a pilot rare-cell augmentation study was negative. The modular architecture enables targeted remediation of each limitation without full retraining. CLOP-DiT establishes the feasibility of structured-metadata-conditioned single-cell generation and provides a composable framework for iterative improvement.

Authors

  • Fu
  • Z.

Categories