Effects of Structural Reward Shaping on Biophysical Properties in RL-Trained Plasmid Generators

Journal: bioRxiv
Published Date:

Abstract

We compare the efficacy and distributional effects of supervised fine-tuning (SFT) and reinforcement learning (RL) post-training for PlasmidGPT, a foundation model for whole-plasmid generation, using Group Relative Policy Optimization (GRPO) for the RL model. Using a biologically motivated reward function encoding functional annotations, length constraints, and repeat penalties, the RL model achieves a 71.6% quality control pass rate across 8 prompts on 4,000 sequences, compared to 4.3% for the pretrained baseline and 11.0% for SFT. A five-model reward ablation identifies the cassette arrangement bonus, which rewards correct promoter[->]CDS[->]terminator ordering, as the critical reward component. Rejection-sampling baselines indicate that the gain is not recovered by sampling more heavily from the base model. Beyond directly optimized features, RL-generated sequences converge toward real plasmid distributions in 3-mer composition, ORF length, and thermodynamic stability, properties we categorize as reward-correlated or indirectly shaped by the structural reward signal. Minimum free energy density independently converges to the real-plasmid regime under both SFT and RL despite these being parallel post-training paths. On a small curated hold-out set, RL improves continuation log-likelihood over the pretrained baseline on every sequence (mean {Delta} = +0.83 nats), with no degradation in next-token prediction.

Authors

  • Thiel
  • M.; Cunningham
  • A.; Barnes
  • C. P.

Categories