Improved workflow for constructing machine learning models: Predicting retention times and peak widths in oligonucleotide separation.

Journal: Journal of chromatography. A
PMID:

Abstract

This study presents an improved workflow to support the development of machine learning models to predict oligonucleotide retention times, peak widths and thus peak resolutions, from larger datasets where manual processing is not feasible. We explored diverse oligonucleotide forms, ranging from native to fully phosphorothioated, using three different gradient slopes. Both native and phosphorothioated oligonucleotides were separated, using a chromatographic C18 system with tributylaminium ion as the ion-pair reagent in the eluent, resulting in retention time data for approximately 900 sequences per gradient. For managing the large and extensive datasets, we developed a semi-automatic rule-based approach for retention time determination, peak decomposition, peak width assessment, signal-to-noise ratio, and skewness analysis. Probability density functions (PDFs) were fitted to elution profiles, with PDF selection based on an F-test. Co-eluting peaks were addressed using a multiple Gaussian PDF. The encoded sequence data underwent modeling using support vector regression (SVR), gradient boosting (GB), random forest (RF), and decision tree (DT) models. GB and SVR showed promise for retention predictions, while RT and DT were faster but demonstrated limited generalization capabilities. The machine learning models exhibited larger errors for the shallowest gradient and lower predictability for P=O sequences, potentially due to signal intensity and sequence heterogeneity. Improvements in signal-to-noise ratios were considered, including mass spectrometry in selected ion monitoring mode. The best model for this data sets were GB, closely followed by the SVR model. With established models for retention and peak width, chromatograms can now be predicted for various gradient slopes, offering prediction of impurity peak resolution for arbitrary sequences and gradient slopes.

Authors

  • Jörgen Samuelsson
    Department of Engineering and Chemical Sciences, Karlstad University, SE-651 88 Karlstad, Sweden. Electronic address: Jorgen.Samuelsson@kau.se.
  • Martin Enmark
    Department of Engineering and Chemical Sciences, Karlstad University, Karlstad SE-651 88, Sweden. Electronic address: Martin.Enmark@kau.se.
  • Gergely Szabados
    Department of Engineering and Chemical Sciences, Karlstad University, Karlstad SE-651 88, Sweden.
  • Manal Rahal
    Department of Mathematics and Computer Science, Karlstad University, Sweden.
  • Bestoun S Ahmed
    Department of Mathematics and Computer Science, Karlstad University, Sweden.
  • Jakob Häggström
    Department of Engineering and Chemical Sciences, Karlstad University, Karlstad SE-651 88, Sweden.
  • Patrik Forssén
    Department of Engineering and Chemical Sciences, Karlstad University, Karlstad SE-651 88, Sweden.
  • Torgny Fornstedt
    Department of Engineering and Chemical Sciences, Karlstad University, SE-651 88 Karlstad, Sweden. Electronic address: Torgny.Fornstedt@kau.se.