Advancing ab initio genome annotation with OrionGeno

Journal: bioRxiv
Published Date:

Abstract

The rapid expansion of eukaryotic genome sequencing has created an urgent demand for scalable and accurate gene annotation, particularly for large-scale genomic initiatives such as the Earth BioGenome Project (EBP). Existing ab initio methods often struggle with complex gene architectures and exhibit limited cross-lineage generalizability. Moreover, these frameworks typically treat repetitive DNA sequences (repeats) as genomic noise to be pre-masked, leaving the joint modeling of genes and repeats largely unexplored. Here we present OrionGeno, a multispecies phylogeny-aware deep learning framework for end-to-end eukaryotic genome annotation. By integrating phylogenetic context into model learning, OrionGeno resolves complex gene structure variations across divergent lineages, jointly predicting exon-intron architectures, UTRs, and repeats directly from genomic sequences. Across Vertebrates, Invertebrates, Viridiplantae and Fungi, OrionGeno consistently outperforms state-of-the-art methods, achieving a 37.2% relative improvement in protein-level F1 score over the existing best-performing method. Beyond benchmarking, OrionGeno identifies novel loci within well-curated model genomes and generates high-confidence annotations for ~1,200 previously uncharacterized species, expanding NCBI's family-level coverage by 40.5%. As an evidence-independent approach, OrionGeno bridges the gap between genome sequencing and functional discovery, holding promise for large-scale biodiversity initiatives like the EBP.

Authors

  • Liu
  • L.; Cai
  • X.; Wang
  • S.; Deng
  • Y.; Wu
  • Y.; Pan
  • Y.; Wang
  • J.; Zhang
  • C.; Xia
  • H.; Tan
  • N.; Su
  • K.; Liu
  • Y.; Zhou
  • X.; Liu
  • L.; Wei
  • T.; Zhang
  • Y.; Li
  • Q.; Li
  • Y.; Yin
  • P.; Xu
  • X.

Categories