Genome-wide methylome modeling via generative AI incorporating long- and short-range interactions.
Journal:
Science advances
PMID:
40215314
Abstract
Using millions of methylation segments, we developed DiffuCpG, a generative artificial intelligence (AI) diffusion model designed to solve the critical challenge of missing data in high-throughput methylation technologies. DiffuCpG goes beyond conventional methods by leveraging both short-range interactions including nearby CpGs from both latitude and longitude of the dataset, local DNA sequences, and long-range interactions, including three-dimensional genome architecture and long-distance correlations, to comprehensively model the methylome. Compared to previous methods, through extensive independent validations across different tissue types, cancers, and technologies (whole-genome bisulfite sequencing, enhanced reduced representation bisulfite sequencing, single-cell bisulfite sequencing, and methylation arrays), DiffuCpG has demonstrated superior performance in accuracy, scalability, and versatility. On average, bisulfite sequencing dataset, DiffuCpG can extend the original dataset by millions of additional CpGs. As an alternative application of generative AI, DiffuCpG addresses a key bottleneck in epigenetic research and will substantially benefit studies relying on high-throughput methylation data.