Improving fecal bacteria estimation using machine learning and explainable AI in four major rivers, South Korea.
Journal:
The Science of the total environment
PMID:
39536862
Abstract
This study addresses the critical public health issue of fecal coliform contamination in the four major rivers in South Korea (Han, Nakdong, Geum, and Yeongsan rivers) by applying advanced machine learning (ML) algorithms combined with Explainable Artificial Intelligence to enhance both prediction accuracy and interpretability. Both traditional and machine learning models often face challenges in accurately estimating fecal coliform levels due to the complexity of environmental variables and data limitations. To address this limitation, we employed two tree-based models (i.e., random forest [RF] and extreme gradient boost [XGBoost]), and two neural network models (i.e., deep neural network and convolutional neural network [CNN]). we employed the use of Shapley Additive Explanations (SHAP) to facilitate a more comprehensive understanding of the influence exerted by each variable on the model's predictions. Based on a comprehensive dataset collected from the National Institute of Environmental Research covering 16 water quality parameters and meteorological data from 2014 to 2022, our study improved the accuracy of fecal coliform estimation using XGBoost and CNN models. The optimal result was obtained using XGBoost, which had a validation Nash-Sutcliffe efficiency of 0.597 in the Han River. In addition, this study provides insights into the significant factors influencing fecal coliform concentrations across different river environments using the SHAP model. The results indicated that the XGBoost model provided superior estimation accuracy and explanations for the contributions of variables. The SHAP results provided the precise contribution of each water quality variable that affected the fecal estimation results using the XGBoost model. The study facilitates an improved understanding of the relationship between water quality variables and fecal coliform contamination mechanisms in the four major rivers in South Korea.