Learning the Unseen: Data-Augmented Deep Learning for PTM Discovery with Prosit-PTM

Journal: bioRxiv
Published Date:

Abstract

Post-translational modifications (PTMs) are critical regulators of protein function, yet confidently identifying and localizing PTM sites across proteomes remains a challenging task. Integrating peptide property predictions into spectrum interpretation improves identification performance, but training data enabling zero-shot prediction across diverse PTMs are scarce. Here, we present a major expansion of the ProteomeTools dataset, comprising over 977,000 synthetic peptides, covering 22 PTM–residue combinations. Furthermore we developed Prosit-PTM, a model with chemically-informed encoding and amino acid substitution-based augmentation trained with our novel ground-truth dataset, that achieves accurate zero-shot predictions. Applied to modified peptides, Prosit-PTM enhances PTM-site localization in phosphoproteomics, increases identification of multiply modified peptides in histones, and enables data-driven rescoring for unseen modifications such as HLA peptides. Furthermore, the learned embeddings of amino acids and modifications capture physicochemical relationships underlying PTM-driven HLA presentation. Prosit-PTM is integrated into multiple open-source tools enabling PTM-aware rescoring, site localization, spectral library generation, and beyond.

Authors

  • Wassim Gabriel; Daniel P. Zolg; Victor Giurcoiu; Omar Shouman; Polina Prokofeva; Florian Seefried; Florian P. Bayer; Ludwig Lautenbacher; Armin Soleymaniniya; Karsten Schnatbaum; Johannes Zerweck; Tobias Knaute; Bernard Delanghe; Andreas Huhmer; Holger Wenschuh; Ulf Reimer; Guillaume Médard; Bernhard Kuster; Mathias Wilhelm