A novel hybrid machine learning approach for accurate retrieval of ocean surface chlorophyll-a across oligotrophic to eutrophic waters.

Journal: Environmental research
Published Date:

Abstract

Accurate assessment of chlorophyll a (Chla) concentration distribution and variations is significant for environmental monitoring and ecological research. However, the inversion of Chla in different optical types of water bodies can only be achieved by establishing algorithms suitable for different optical types, lacking a machine learning algorithm framework. Therefore, this study focuses on two aspects, input features and data samples, and designs an innovative composite machine learning algorithm framework called Synth Ridge Framework (SRF). The framework mainly consists of two main components: feature expansion and model construction. We employed the band ratio method and BorutaShap for feature expansion and selection. By integrating three gradient boosting decision tree models (XGBoost, LightBoost, and CatBoost) with the MDN ensemble strategy, we constructed a model named SynthRidge, aiming to enhance the model's overall performance. SynthRidge was trained and validated using the Rrs-In situ Chla dataset from the Terra-MODIS sensor, with Chla values ranging from 0 to 50 mg/m in both datasets. On mg/mthe validation dataset, the SynthRidge model achieved strong predictive performance, with an R of 0.930, a slope of 0.928, an RMSE of 4.672 mg/m, an RMLSE of 0.039, a bias of 1.023, and an MAE of 1.389. Compared to the best-performing baseline model, the GBDT ensemble, SynthRidge demonstrated superior accuracy and robustness. Specifically, it improved the R by 0.006, increased the slope by 0.020, reduced the RMSE by 0.890 mg/m, and decreased the RMLSE by 0.003. The validation dataset has its R, Slope, RMSE, RMLSE, Bias, and MAE values of 0.930, 0.928, 4.672 mg/m, 0.039, 1.023, and 1.389, respectively. The predicted Chla density distribution by SynthRidge was more consistent with the measured values. These findings suggest that SRF is capable of effectively compensating for the limitations of input features, reducing the negative impact of data distribution, and improving the limitations of complex fusion algorithms. Furthermore, the performance of SRF on the SeaWiFS dataset demonstrates its versatility across different sensors.

Authors

  • Ting Qin
    College of Geomatics and Geoinformation, Guilin University of Technology, Guilin, 541006, China; Ecological Spatiotemporal Big Data Perception Service Laboratory, Guilin University of Technology, Guilin, 541006, China.
  • Tianlong Liang
    NetCraft Information Technology (Macau) Co., Ltd., Macau, 999078, China.
  • Donglin Fan
    College of Geomatics and Geoinformation, Guilin University of Technology, Guilin, 541006, China; Ecological Spatiotemporal Big Data Perception Service Laboratory, Guilin University of Technology, Guilin, 541006, China. Electronic address: dlfan@glut.edu.cn.
  • Hongchang He
    College of Geomatics and Geoinformation, Guilin University of Technology, Guilin, 541006, China.
  • Guiwen Lan
    College of Geomatics and Geoinformation, Guilin University of Technology, Guilin, 541006, China; Ecological Spatiotemporal Big Data Perception Service Laboratory, Guilin University of Technology, Guilin, 541006, China.
  • Bolin Fu
    College of Geomatics and Geoinformation, Guilin University of Technology, Guilin, 541006, China; Ecological Spatiotemporal Big Data Perception Service Laboratory, Guilin University of Technology, Guilin, 541006, China.