Decoding Prokaryotic Whole Genomes with a Product-Contextualized Large Language Model

Journal: bioRxiv
Published Date:

Abstract

Genomes encode the instructions for life, yet their full interpretation requires models capable of capturing long-range context and functional meaning at scale. Existing genome language models (gLMs) are limited by short context windows, high computational cost, and poor interpretability. We present GenSyntax, a product-contextualized large language model (LLM) trained on 49,250 annotated prokaryotic genomes. GenSyntax replaces nucleotide tokenization with gene product descriptors, transforming genomes into “genetic paragraphs” that preserve functional semantics. Using a two-stage training strategy, GenSyntax achieves leading performance in plasmid host identification, gene function prediction, genome assembly, and gene essentiality assessment compared with the other LLMs. It also enables phenotype prediction and minimal genome design, establishing a scalable and interpretable framework for genome-scale decoding and synthetic biology.

Authors

  • Shiwen Ni; Shuaimin Li; Shijian Wang; Xinping Bi; Yitai Li; Chengguang Gan; Jiarui Jin; Yuan Lu; Ahmadreza Argha; Hamid Alinejad-Rokny; Tong Si; Min Yang; Teng Wang