Foundation model for mass spectrometry proteomics
Journal:
arXiv
Published Date:
May 16, 2025
Abstract
Mass spectrometry is the dominant technology in the field of proteomics,
enabling high-throughput analysis of the protein content of complex biological
samples. Due to the complexity of the instrumentation and resulting data,
sophisticated computational methods are required for the processing and
interpretation of acquired mass spectra. Machine learning has shown great
promise to improve the analysis of mass spectrometry data, with numerous
purpose-built methods for improving specific steps in the data acquisition and
analysis pipeline reaching widespread adoption. Here, we propose unifying
various spectrum prediction tasks under a single foundation model for mass
spectra. To this end, we pre-train a spectrum encoder using de novo sequencing
as a pre-training task. We then show that using these pre-trained spectrum
representations improves our performance on the four downstream tasks of
spectrum quality prediction, chimericity prediction, phosphorylation
prediction, and glycosylation status prediction. Finally, we perform multi-task
fine-tuning and find that this approach improves the performance on each task
individually. Overall, our work demonstrates that a foundation model for tandem
mass spectrometry proteomics trained on de novo sequencing learns generalizable
representations of spectra, improves performance on downstream tasks where
training data is limited, and can ultimately enhance data acquisition and
analysis in proteomics experiments.