CodonTransformer: a multispecies codon optimizer using context-aware neural networks.

Journal: Nature communications
PMID:

Abstract

Degeneracy in the genetic code allows many possible DNA sequences to encode the same protein. Optimizing codon usage within a sequence to meet organism-specific preferences faces combinatorial explosion. Nevertheless, natural sequences optimized through evolution provide a rich source of data for machine learning algorithms to explore the underlying rules. Here, we introduce CodonTransformer, a multispecies deep learning model trained on over 1 million DNA-protein pairs from 164 organisms spanning all domains of life. The model demonstrates context-awareness thanks to its Transformers architecture and to our sequence representation strategy that combines organism, amino acid, and codon encodings. CodonTransformer generates host-specific DNA sequences with natural-like codon distribution profiles and with minimum negative cis-regulatory elements. This work introduces the strategy of Shared Token Representation and Encoding with Aligned Multi-masking (STREAM) and provides a codon optimization framework with a customizable open-access model and a user-friendly Google Colab interface.

Authors

  • Adibvafa Fallahpour
    Vector Institute for Artificial Intelligence, Toronto, ON, Canada.
  • Vincent Gureghian
    Sorbonne Université, CNRS, ERL U1338 Inserm, Department of Computational, Quantitative and Synthetic Biology, Paris, France.
  • Guillaume J Filion
    Genome Architecture, Gene Regulation, Stem Cells and Cancer Programme, Centre for Genomic Regulation (CRG), Barcelona, Spain.
  • Ariel B Lindner
    Sorbonne Université, CNRS, ERL U1338 Inserm, Department of Computational, Quantitative and Synthetic Biology, Paris, France. ariel.lindner@inserm.fr.
  • Amir Pandi
    Micalis Institute, INRA, AgroParisTech, Université Paris-Saclay, Jouy-en-Josas, France.