Comparing the performance of 10 machine learning models in predicting Chlorophyll a in western Lake Erie.

Journal: Journal of environmental management
PMID:

Abstract

Algal blooms, which have substantial adverse effects, are increasingly occurring worldwide in the context of global warming and eutrophication. Machine learning models (MLMs) are emerging as efficient and promising tools for predicting algal blooms. However, the performance of MLMs in directly simulating algal blooms has seldom been reported, particularly in the world's largest freshwater system, the Great Lakes. To address this gap, we compared the prediction performance of Chlorophyll a (Chl a, a proxy for algal biomass) concentration in western Lake Erie among 10 popular MLMs using 15 measured water quality data collected from 2012 to 2022. Results have shown that outlier removal is essential, as it can noticeably improve prediction accuracy such as increasing the coefficient of determination (R) from 0.35 to 0.84 (140 %) for the optimal Gradient Boosting Decision Trees (GBDT) model. All 32,767 feature combinations of measured water quality parameters were exhaustively tested for each MLM and the best feature combinations are identified. MLMs benefit from this feature selection, with the Polynomial Regression model showing notable improvements: the R increased from 0.71 to 0.82 (15 %) compared to no feature selection. The tree-based ensemble models, including the GBDT (R = 0.84) and Random Forest (R = 0.82) models, show the top two performances in predicting Chl a. Based on feature importance analysis, particulate organic nitrogen (PON) is determined to be the most critical water quality parameter for predicting Chl a. These results establish a benchmark for the performance of common MLMs in predicting Chl a in western Lake Erie. The determined best feature combinations potentially make water quality observations more effective and targeted, thereby benefiting sustainable water quality management.

Authors

  • Yang Song
    Biomedical and Multimedia Information Technology (BMIT) Research Group, School of IT, University of Sydney, NSW 2006, Australia. Electronic address: yson1723@uni.sydney.edu.au.
  • Chunqi Shen
    Yale School of Environment, Yale University, New Haven, CT, 06511, United States.
  • Yi Hong
    Department of Product Engineering, MedeAnalytics, Inc., Emeryville, CA, USA.