Enformer-Based Phylogenetic Tree Reconstruction
Journal:
bioRxiv
Published Date:
Jun 6, 2026
Abstract
Enformer is a deep learning model trained on human and mouse genomes to predict regulatory activity from 196,608 bp DNA windows. Its trunk embeddings capture long-range cis-regulatory interactions, but whether this signal generalises across the tree of life has not been assessed. We embed universal single-copy orthologous groups (OGs) from OrthoDB v12 across three taxonomic scales and evaluate reconstructed trees against TimeTree5 using Mantel r and Normalised Robinson-Foulds (NRF). On 702 OGs across 34 Primate species ([≤]74 Mya), the consensus tree achieves Mantel r=0.902 and NRF=0.481, correctly recovering major clades. A key finding is that flanking regulatory context - not the gene locus itself - carries the phylogenetic signal: restricting pooling to central 448 bins collapses Mantel r to 0.355. Applying the same fixed configuration to Vertebrates ([≤]450 Mya, 83 OGs, 150 species) and Plants ([≤]1,500 Mya, 92 OGs, 40 species) yields consensus Mantel r of 0.752 and 0.803 respectively, with NRF worsening monotonically across tiers. Distance-ordering fidelity degrades smoothly with evolutionary distance while topological accuracy declines steadily, with no sharp taxonomic boundary. These results show that an unmodified regulatory deep learning model encodes robust phylogenetic signal well beyond its training distribution, reaching across 1,500 million years of divergence.