OryzaG3: A Single-species Genomic Foundation Model Pretrained on Rice Pangenome

Journal: bioRxiv
Published Date:

Abstract

While multi-species genomic language models have advanced biological representation learning, high-quality, single-species foundation models for crops remain scarce. Leveraging recently expanded rice pangenome resources, we introduce OryzaG3, a species-specific DNA language model with 700M parameters. OryzaG3 was pretrained on 59.20 Gb of chromosome-level sequences from 149 high-quality rice genomes using a non-overlapping 3-mer tokenization strategy and a causal language modeling objective, featuring context-length variants up to 32k tokens. On the Plants Genomic Benchmark polyA prediction task, OryzaG3 achieves competitive predictive performance against leading multi-species models while delivering a four-fold increase in inference throughput under identical long-context conditions. Ultimately, OryzaG3 demonstrates that lightweight, single-species foundation models trained on high-quality pangenomes can match multi-species benchmarks while significantly reducing computational overhead. This work provides a scalable framework for rice functional genomics, molecular breeding, and targeted crop foundation model development.

Authors

  • Yang
  • L.; Xia
  • Y.; Yang
  • Z.; Xia
  • C.; Wu
  • T.; Zou
  • M.; Xia
  • Z.

Categories