Deciphering functional dark matter: Machine and deep learning-based processing of protein embeddings enables targeted function discoveries

Journal: bioRxiv
Published Date:

Abstract

The ever-expanding catalogue of uncharacterized proteins - the so called functional dark matter - poses a major challenge for biotechnological and biomedical exploitation. Functional assessment of most proteins is hindered by the technical limitations of annotation transfer and by the propagation of erroneous annotations in databases. The common denominator here is the reliance on sequence similarities. However, these become inaccurate below certain thresholds and can diverge even at sequence identities around 70%. To approach this challenge, we implemented a strategy using embeddings generated by protein language models for targeted function discovery (PE-TFD). Datasets of proteins representing target as well as non-target functions were used to train supervised learning models. The resulting ensemble models yielded interpretable prediction scores, enabling the exploration of databases without relying on multiple sequence alignments or structural information. We here tested PE-TFD for the discovery of novel hydrogenases as proof-of-concept, resulting in the novel discovery of 773 [NiFe] and 1,929 [FeFe] hydrogenases that were not detected by established sequence- or profile-based approaches. Structural analyses supported their non-random nature and further revealed a significant number of enzymes lacking prior functional annotation. Our framework therefore enables interpretable function discovery in large-scale datasets and the exploitation of functional dark matter.

Authors

  • Wiegand
  • S.; Kaster
  • A.-K.

Categories