geneML: Gene annotation across diverse fungal species using deep learning

Journal: bioRxiv
Published Date:

Abstract

Accurate gene prediction remains a major bottleneck in fungal genomics, where lineage diversity and alternative splicing challenge existing ab initio methods. Here, we present geneML, a deep learning-based gene prediction tool tailored to fungal genomes. Across nine reference genomes spanning diverse fungal taxa, geneML improved gene-level F1 score from 64.9 to 67.1 compared to BRAKER3 with protein-based hints, driven by substantially higher recall (69.0 vs. 64.1) at equivalent precision. geneML also remains fast, averaging around 6 minutes per genome on a standard 8-core CPU. A key feature of geneML is its ability to predict alternative transcripts. Compared to Fusarium graminearum Iso-Seq control data, it achieves 41.1% transcript recall and 71.1% precision, outperforming AUGUSTUS (33.8% recall, 48.9% precision), one of the few tools that support isoform prediction. The predicted transcript diversity is consistent with experimentally observed fungal alternative splicing patterns. Reannotation of the curated training dataset further suggests improved biological completeness, with geneML recovering 15.3% more genes containing complete PFAM domains than the reference annotation. These results demonstrate that geneML enables faster, more sensitive, and more biologically informative fungal genome annotation. geneML is available as an open-source command-line tool at https://github.com/hexagonbio/geneML.

Authors

  • Vader
  • L.; Harvey
  • C. J.; Weber
  • T.; Hon
  • L. S.

Categories