Decoding Prokaryotic Whole Genomes with a Product-Contextualized Large Language Model
Journal:
bioRxiv
Published Date:
Jan 1, 2025
Abstract
Genomes encode the instructions for life, yet their full interpretation requires models capable of capturing long-range context and functional meaning at scale. Existing genome language models (gLMs) are limited by short context windows, high computational cost, and poor interpretability. We present GenSyntax, a product-contextualized large language model (LLM) trained on 49,250 annotated prokaryotic genomes. GenSyntax replaces nucleotide tokenization with gene product descriptors, transforming genomes into “genetic paragraphs” that preserve functional semantics. Using a two-stage training strategy, GenSyntax achieves leading performance in plasmid host identification, gene function prediction, genome assembly, and gene essentiality assessment compared with the other LLMs. It also enables phenotype prediction and minimal genome design, establishing a scalable and interpretable framework for genome-scale decoding and synthetic biology.