Limitations of de novo sequencing in resolving sequence ambiguity

Journal: bioRxiv
Published Date:

Abstract

De novo peptide sequencing enables peptide identification from fragmentation spectra without relying on sequence databases. However, incomplete spectra create ambiguity, making unambiguous identification challenging. Recent deep learning advances have produced numerous de novo models that predict sequences and refine peptide–spectrum matches under such conditions. Yet, their relative strengths, weaknesses, and ability to handle spectrum ambiguity remain unclear. Here, we benchmark eight state-of-the-art models on three publicly available proteomics datasets, comparing performance using established metrics and quantifying inter-model agreement. We assess post-processing approaches, including iterative refinement, rescoring, and reranking, for their ability to improve identification accuracy, and perform an error analysis to identify common mispredictions and their causes. Model performance varied, with considerable overlap of correct identifications. Post-processing yielded no or only modest improvements. Most sequencing errors were model-independent and driven by limited fragment ion coverage, a limitation also observed in database searches with large search spaces.

Authors

  • Sam van Puyenbroeck; Denis Beslic; Tomi Suomi; Tanja Holstein; Thilo Muth; Laura L. Elo; Lennart Martens; Robbin Bouwmeester; Tim Van Den Bossche; Tine Claeys