Evolutionary transfer learning enables organism-wide inference of mammalian enhancer landscapes

Journal: bioRxiv
Published Date:

Abstract

Understanding and modeling how a single human genome concurrently encodes gene regulatory programs for thousands of cell types remains a central challenge in genomics and machine learning. Most human cell types emerge during embryonic, fetal, and pediatric development which are inaccessible to comprehensive molecular profiling. To circumvent this, we hypothesized that the mismatch in evolutionary rates between cis-acting enhancers that modulate gene expression (fast) and the trans-acting regulatory factors that specify cell types (slow) creates an opportunity for 'evolutionary transfer learning'. Specifically, models trained to predict cell type-specific enhancers in one species should generalize to the orthologous cell types and enhancers of related species. To test this, we generated a single-cell atlas of chromatin accessibility spanning mouse embryonic day 10 (E10) to birth (P0). Using combinatorial indexing, we profiled 3.9 million nuclei from 36 staged embryos, resolving genome-wide accessibility in 36 cell classes and 140 cell types. We then trained a series of multi-output deep learning models (CREsted), each addressing limitations of the preceding approach, towards the goal of genome-wide prediction of distal enhancers across major developmental lineages. An 'evolution-naive' model achieved strong performance on heldout peaks, but exhibited two failure modes during genome-wide inference: overprediction at tandem repeats and conflation of promoter and distal enhancer grammars. An 'evolution-aware' model resolved these by regrouping accessible regions based on their retention and functional coherence across mammalian evolution, but failed to generalize across species. Finally, an 'evolution-augmented' model, STEAM (Synteny-aware Transfer learning for Enhancer Activity Modeling), incorporated enhancer orthologs from 241 mammalian genomes (Zoonomia) in a synteny-supervised manner. This increased the effective data scale by as much as 195-fold, markedly improving generalization across mammals despite greater label noise. We applied STEAM to the genome-wide inference of cell class-specific distal developmental enhancers in humans, mice (HumMus) and 239 additional mammals (BabaGanoush), i.e. 32 x 241 = 7,712 genome-wide distal enhancer prediction tracks. Together, our results unify advances in single-cell profiling, deep learning, and comparative genomics into a framework for the evolutionary transfer learning of noncoding regulatory grammars. More broadly, our work supports the view that model organisms and evolutionarily diverse genomes are indispensable resources for accelerating and enhancing the AI-enabled exploration of human biology. Note: An interactive version of this preprint, together with count matrices, CREsted models, prediction tracks, code and reproducible figures, is available at https://doi.org/10.62329/hxkk6249.

Authors

  • Qiu
  • C.; Daza
  • R. M.; Welsh
  • I. C.; Patwardhan
  • R. P.; Martin
  • B. K.; Li
  • T.; Yang
  • S.; Mannens
  • C. C. A.; De Winter
  • S.; Kempynck
  • N.; Taylor
  • M. L.; Fulton
  • O.; Le
  • T.-M.; O'Day
  • D. R.; Lalanne
  • J.-B.; Domcke
  • S.; Murray
  • S. A.; Aerts
  • S.; Trapnell
  • C.; Shendure
  • J.

Categories