Advancing ab initio genome annotation with OrionGeno
Journal:
bioRxiv
Published Date:
Apr 29, 2026
Abstract
The rapid expansion of eukaryotic genome sequencing has created an urgent demand for scalable and accurate gene annotation, particularly for large-scale genomic initiatives such as the Earth BioGenome Project (EBP). Existing ab initio methods often struggle with complex gene architectures and exhibit limited cross-lineage generalizability. Moreover, these frameworks typically treat repetitive DNA sequences (repeats) as genomic noise to be pre-masked, leaving the joint modeling of genes and repeats largely unexplored. Here we present OrionGeno, a multispecies phylogeny-aware deep learning framework for end-to-end eukaryotic genome annotation. By integrating phylogenetic context into model learning, OrionGeno resolves complex gene structure variations across divergent lineages, jointly predicting exon-intron architectures, UTRs, and repeats directly from genomic sequences. Across Vertebrates, Invertebrates, Viridiplantae and Fungi, OrionGeno consistently outperforms state-of-the-art methods, achieving a 37.2% relative improvement in protein-level F1 score over the existing best-performing method. Beyond benchmarking, OrionGeno identifies novel loci within well-curated model genomes and generates high-confidence annotations for ~1,200 previously uncharacterized species, expanding NCBI's family-level coverage by 40.5%. As an evidence-independent approach, OrionGeno bridges the gap between genome sequencing and functional discovery, holding promise for large-scale biodiversity initiatives like the EBP.