Genome-wide methylome modeling via generative AI incorporating long- and short-range interactions.

Journal: Science advances
PMID:

Abstract

Using millions of methylation segments, we developed DiffuCpG, a generative artificial intelligence (AI) diffusion model designed to solve the critical challenge of missing data in high-throughput methylation technologies. DiffuCpG goes beyond conventional methods by leveraging both short-range interactions including nearby CpGs from both latitude and longitude of the dataset, local DNA sequences, and long-range interactions, including three-dimensional genome architecture and long-distance correlations, to comprehensively model the methylome. Compared to previous methods, through extensive independent validations across different tissue types, cancers, and technologies (whole-genome bisulfite sequencing, enhanced reduced representation bisulfite sequencing, single-cell bisulfite sequencing, and methylation arrays), DiffuCpG has demonstrated superior performance in accuracy, scalability, and versatility. On average, bisulfite sequencing dataset, DiffuCpG can extend the original dataset by millions of additional CpGs. As an alternative application of generative AI, DiffuCpG addresses a key bottleneck in epigenetic research and will substantially benefit studies relying on high-throughput methylation data.

Authors

  • Fengyao Yan
    Department of Public Health and Sciences, University of Miami, Miami, FL 33126, USA.
  • Aristeidis G Telonis
    Department of Biochemistry and Molecular Biology, University of Miami Miller School of Medicine, Miami, FL 33136, USA.
  • Qin Yang
    State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha 410082, China; School of Physics and Optoelectronic Engineering, Yangtze University, Jingzhou 434023, China.
  • Limin Jiang
    School of Computer Science and Technology, Tianjin University, Tianjin 300350, China; School of Information and Electrical Engineering, Hebei University of Engineering, Handan 056038, China.
  • Francine E Garrett-Bakelman
    Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA 22908, USA.
  • Mikkael A Sekeres
    Leukemia Program, Department of Hematology and Medical Oncology, Cleveland Clinic, Cleveland, OH; and.
  • Valeria Santini
    AOU Careggi, University of Florence, Florence, Italy. Electronic address: valeria.santini@unifi.it.
  • Michele Ceccarelli
    Computational Biology-Genomic Research Center, ABBVIE, Redwood City, CA, USA. michele.ceccarelli@unina.it.
  • Neha Goel
    Sylvester Comprehensive Cancer Center, University of Miami Miller School of Medicine, Miami, FL 33136, USA.
  • Liliana Garcia-Martinez
    Sylvester Comprehensive Cancer Center, University of Miami Miller School of Medicine, Miami, FL 33136, USA.
  • Lluis Morey
    Sylvester Comprehensive Cancer Center, University of Miami Miller School of Medicine, Miami, FL 33136, USA.
  • Maria E Figueroa
    Sylvester Comprehensive Cancer Center, University of Miami Miller School of Medicine, Miami, FL 33136, USA.
  • Yan Guo
    State Key Laboratory of Pathogen and Biosecurity, Beijing 100071, China.