Single-sequence protein structure prediction using a language model and deep learning.

Journal: Nature biotechnology
Published Date:

Abstract

AlphaFold2 and related computational systems predict protein structure using deep learning and co-evolutionary relationships encoded in multiple sequence alignments (MSAs). Despite high prediction accuracy achieved by these systems, challenges remain in (1) prediction of orphan and rapidly evolving proteins for which an MSA cannot be generated; (2) rapid exploration of designed structures; and (3) understanding the rules governing spontaneous polypeptide folding in solution. Here we report development of an end-to-end differentiable recurrent geometric network (RGN) that uses a protein language model (AminoBERT) to learn latent structural information from unaligned proteins. A linked geometric module compactly represents C backbone geometry in a translationally and rotationally invariant way. On average, RGN2 outperforms AlphaFold2 and RoseTTAFold on orphan proteins and classes of designed proteins while achieving up to a 10-fold reduction in compute time. These findings demonstrate the practical and theoretical strengths of protein language models relative to MSAs in structure prediction.

Authors

  • Ratul Chowdhury
    Department of Chemical Engineering, The Pennsylvania State University, University Park, PA 16802.
  • Nazim Bouatta
    Laboratory of Systems Pharmacology, Program in Therapeutic Science, Harvard Medical School, Boston, MA, USA. nazim_bouatta@hms.harvard.edu.
  • Surojit Biswas
    Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA, USA.
  • Christina Floristean
    Department of Computer Science, Columbia University, New York, NY, USA.
  • Anant Kharkar
    Department of Computer Science, Columbia University, New York, NY, USA.
  • Koushik Roy
    Department of Computer Science, Columbia University, New York, NY, USA.
  • Charlotte Rochereau
    Integrated Program in Cellular, Molecular, and Biomedical Studies, Columbia University, New York, NY, USA.
  • Gustaf Ahdritz
    Department of Systems Biology, Columbia University, New York, NY, USA.
  • Joanna Zhang
    Department of Computer Science, Columbia University, New York, NY, USA.
  • George M Church
    Wyss Institute for Biologically Inspired Engineering , Boston, Massachusetts 02115, United States.
  • Peter K Sorger
    Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, USA peter_sorger@hms.harvard.edu.
  • Mohammed AlQuraishi
    Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA 02115, USA; Department of Systems Biology, Harvard Medical School, Boston, MA 02115, USA. Electronic address: alquraishi@hms.harvard.edu.