Generating realistic artificial human genomes using adversarial autoencoders.

Journal: NAR genomics and bioinformatics
Published Date:

Abstract

A publicly available human genome is both valuable to researchers and a risk for its donor. Many actors could exploit it to extract information about the donor's health or that of their relatives. Recent efforts have employed artificial intelligence models to simulate genomic data, aiming to create synthetic datasets with scientific merit while preserving patient anonymity. Challenges arise due to the vast amount of data that constitute a complete human genome and the computational resources required. We present a dimension reduction method that combines artificial intelligence with our knowledge of mutation association mechanisms. This approach enables processing large amounts of data without significant computational resources. Our genome segmentation follows chromosomal recombination hotspots, closely resembling mutation transmission mechanisms. Data from the 1000 Genomes Project are used to train variational autoencoders with a Wasserstein GAN to generate novel data in a two-step process. After optimizing our strategy, our pipeline can generate a simulated population meeting several essential criteria. They are diverse but realistic; the newly generated combinations of mutations follow linkage disequilibrium found in humans. Our pipeline does not reveal the genetic identity of any individual donor, synthesizing genomes that differ from reference samples.

Authors

  • Callum Burnard
    Institut de Génétique Humaine, 34094 Montpellier, France.
  • Alban Mancheron
    Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier, 34095 Montpellier, France.
  • William Ritchie
    Institut de Génétique Humaine (IGH-UMR9002), Centre National de la Recherche Scientifique (CNRS), University of Montpellier, Montpellier, France. william.ritchie@igh.cnrs.fr.