Predicting function of evolutionarily implausible DNA sequences
Journal:
arXiv
Published Date:
Jun 12, 2025
Abstract
Genomic language models (gLMs) show potential for generating novel,
functional DNA sequences for synthetic biology, but doing so requires them to
learn not just evolutionary plausibility, but also sequence-to-function
relationships. We introduce a set of prediction tasks called Nullsettes, which
assesses a model's ability to predict loss-of-function mutations created by
translocating key control elements in synthetic expression cassettes. Across 12
state-of-the-art models, we find that mutation effect prediction performance
strongly correlates with the predicted likelihood of the nonmutant.
Furthermore, the range of likelihood values predictive of strong model
performance is highly dependent on sequence length. Our work highlights the
importance of considering both sequence likelihood and sequence length when
using gLMs for mutation effect prediction.