GeneCAD: Plant Genome Annotation with a DNA Foundation Model
Journal:
bioRxiv
Published Date:
Jan 1, 2025
Abstract
Accurate genome annotation remains a bottleneck in plants, where polyploidy and repeat-rich sequence confound homology- and RNA-based pipelines. We introduce GeneCAD, a sequence-only method that predicts complete plant gene models directly from DNA. GeneCAD couples representations from a plant DNA foundation model, PlantCAD2, with a lightweight ModernBERT encoder and a chromosome-wide conditional random field that enforces splice-phase and feature order, and applies a protein language-model screen to suppress repeat-driven open reading frames. To limit label noise, we rank and filter public annotations using a sequence-based masked-motif score and fine-tune on five phylogenetically diverse, high-quality references. Across five held out angiosperms, including the allotetraploid Nicotiana tabacum, GeneCAD improves transcript-level F1 by 8–10% on average over Helixer and BRAKER3, increases exact match transcripts, and sharpens boundaries at start/stop codons and splice junctions. By removing dependence on species-matched RNA-seq or proteomics while retaining cross-species accuracy, GeneCAD provides an accurate, scalable route to biologically coherent plant gene models from DNA alone.