Enhancing identification confidence in non-targeted screening of emerging contaminants via an ensemble retention time prediction model: Applications in screening and ecological risk assessment.

Journal: Environmental pollution (Barking, Essex : 1987)
Published Date:

Abstract

The increasing emerging contaminants (ECs) pose significant challenges to non-targeted screening (NTS) and annotation. Machine learning-based retention time (RT) prediction models offer a promising approach to narrow candidate compounds and enhancing identification accuracy. However, existing studies rely on a single machine learning algorithm, which is susceptible to overfitting or underfitting on particular datasets and may yield substantial prediction bias. To address these limitations and improve both predictive performance and generalization capability, we developed an ensemble modeling framework that integrated predictions from multiple base models (i.e., eXtreme Gradient Boosting, Light Gradient Boosting Machine, Random Forest, and Support Vector Regression) through a weighted fusion strategy. Based on our self-built database of 362 ECs, we used molecular descriptors and fingerprints to construct this ensemble model framework to predict RT of ECs. The optimized ensemble model significantly outperformed individual models (R2 = 0.96 vs. R2 = 0.57-0.87). Further feature optimization reduced training and prediction times by 72.8 % and 96.2 %, respectively. Applied to screen for ECs in sewage and soil samples, the ensemble model enabled high-confidence classification (ΔRT <1.5 min) for 101 S2 level ECs, reducing EC candidates by 57 %. Meanwhile, bisphenol A (BPA) and tris(2-chloroethyl) phosphate (TCEP) with higher confidence and larger relative peak areas were selected for quantitative verification. The results showed that both were detected in sewage and soil samples (BPA: 175.5 μg/kg; TCEP: 788-3957 ng/L), further verifying the application ability of the model. Ecological risk assessment via toxicological priority index identified personal care products and pharmaceuticals as primarily high-risk ECs, with fipronil, ensulizole, lidocaine, amantadine, and sulpiride posing greatest risks. This ensemble framework provided precise RT prediction for NTS, improving EC detection efficiency and supporting risk management.

Authors

Keywords

No keywords available for this article.