Data-driven prediction of daily Cryptosporidium river concentrations for water resource management: Use of catchment-averaged vs spatially distributed features in a Bagging-XGBoost model.
Journal:
The Science of the total environment
Published Date:
Jun 20, 2025
Abstract
Cryptosporidium is a waterborne pathogen which poses a major challenge to water utilities because of its resistance to chlorination and its infectivity at very low concentrations. The ability to make predictions of Cryptosporidium concentrations in rivers would aid significantly in abstraction-based risk management of water resources, but current models are inappropriate for making predictions at the temporal resolutions required to inform abstraction decision-making. This study utilises Cryptosporidium data collected over 7 years at a major river abstraction site in South East England, alongside publicly-available remote sensing data, to train a Bagging-XGBoost model for Cryptosporidium predictive applications at daily timescales. Different combinations of catchment-averaged and spatially distributed datasets were trialled as model inputs. The highest-performing models predicted 69-75 % of >1 oocysts L exceedances, and they also predicted the timing of 78-89 % of higher (>2 oocysts L) exceedances. Interpretation of predictions using SHapley Additive exPlanations analysis indicated that sources near (<30 km) to the intake were the most important and identified catchment-averaged rainfall at 1 and 2-day lag time and antecedent Cryptosporidium measurements as significant inputs. The study demonstrates the potential of such models when an unparsimonious approach to feature selection is taken, because of their ability to discern non-linear trends and their resistance to multicollinearity and redundancy in the input data. Such models could improve the ability of water utilities to predict Cryptosporidium peaks and aid abstraction decision-making, thereby reducing the loadings of this pathogen to reservoirs and water treatment works.