Functional In-Context Learning in Genomic Language Models with Nucleotide-Level Supervision and Genome Compression

Journal: bioRxiv
Published Date:

Abstract

Genomic foundation models aim to learn general-purpose representations directly from DNA sequence, enabling sequence understanding, generation, and probabilistic reasoning across a wide range of biological tasks. Scaling such models to genomic lengths, however, remains challenging due to the tension between long-range context, nucleotide-level resolution, and practical computational efficiency. Architectural innovations have enabled increasingly long nominal inputs, but often struggle to translate additional context into meaningful performance gains, particularly in the presence of sparse functional signal along eukaryotic genomes. In this work, we revisit the design of long-context genomic foundation models from the perspective of training objective and data construction. We introduce Factorized Nucleotide Supervision (FNS), which reconciles efficient k-mer tokenization with single-nucleotide likelihoods through probability marginalization, and Genome Compression Pretraining (GCP), which reshapes the training distribution by concentrating on gene-centric and regulatory regions. Together, these techniques enable standard transformer-based models to perform functional in-context learning without sacrificing nucleotide-level fidelity or computational efficiency. Building on these ideas, we present a family of autoregressive genomic foundation models supporting contexts of up to 98k base pairs across eukaryotic and prokaryotic genomes. Across training-free evaluations and downstream fine-tuning benchmarks, our models consistently improve over prior approaches and match or exceed state-of-the-art baselines while enabling substantially more efficient inference. Together, these results demonstrate that aligning supervision and data regimes with the biological structure of genomic sequence provides a principled and effective path toward scalable and biologically faithful genomic language modeling. Models, data, and scripts for downstream analyses are publicly available at https://huggingface.co/GenerTeam.

Authors

  • Li
  • Q.; Zhan
  • Z.; Feng
  • S.; Zhu
  • Y.; He
  • Y.; Wu
  • W.; Shi
  • Z.; Wang
  • S.; Hu
  • Z.; Yang
  • Z.; Li
  • J.; Tang
  • J.; Liu
  • H.; Qin
  • T.

Categories