Dissolved organic carbon estimation in lakes: Improving machine learning with data augmentation on fusion of multi-sensor remote sensing observations.

Journal: Water research
PMID:

Abstract

This paper presents a novel approach for estimating Dissolved Organic Carbon (DOC) concentrations in lakes considering both carbon sources and sink operators. Despite the critical role of DOC, the combined application of machine learning, as a robust predictor, and remote sensing technology, which reduces costly and time-intensive in-situ sampling, has been underexplored in DOC research. Focusing on lakes over the states of New York, Vermont and Maine (United States, U.S.), this study integrates in-situ DOC measurements with surface reflectance bands obtained from Landsat satellites between 2000 and 2020. Using these bands as inputs of the Random Forest (RF) predictive model, the introduced methodology aims to explore the ability of remote sensing data for large-scale DOC simulation. Initial results indicate low accuracy metrics and significant under-estimation due to the imbalance distribution of DOC samples. Statistical analysis showed that the mean DOC concentration was 5.37±3.37 mg/L (mean±one standard deviation), with peak up to 25 mg/L. A highly skewed distribution of chemical components towards the lower ranges can lead to model misrepresentation of extreme and hazardous events, as they are clouded by unimportant events due to significantly lower occurrence rates. To address this issue, the Synthetic Minority Over-sampling Technique (SMOTE) was applied as a key innovation, generating synthetic samples that enhance RF accuracy and reduce the associated errors. Fusion of in-situ and remote sensing data, combined with machine learning and data augmentation, significantly enhances DOC estimation accuracy, especially in high concentration ranges which are critical for environmental health. With prediction metrics of RMSE = 1.75, MAE = 1.09, and R = 0.74, RF-SMOTE significantly improve the metrics obtained from stand-alone RF, particularly in estimating high DOC concentrations. Considering the product spatial resolution of 30 m, the model's output provides potential revenue for global application in lake monitoring, even in remote regions where direct sampling is limited. This novel fusion of remote sensing, machine learning and data augmentation offers valuable insights for water quality management and understanding carbon cycling in aquatic ecosystems.

Authors

  • Seyed Babak Haji Seyed Asadollah
    Department of Civil Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran.
  • Ahmadreza Safaeinia
    Department of Environmental Resources Engineering, State University of New York, College of Environmental Science and Forestry, 1 Forestry Drive, Syracuse, NY 13210, USA. Electronic address: asafaeinia@esf.edu.
  • Sina Jarahizadeh
    Department of Environmental Resources Engineering, State University of New York, College of Environmental Science and Forestry, 1 Forestry Drive, Syracuse, NY 13210, USA. Electronic address: Sjarahizadeh@esf.edu.
  • Francisco Javier Alcalá
    Departamento de Desertificación y Geo-Ecología, Estación Experimental de Zonas Áridas (EEZA-CSIC), 04120 Almería, Spain; Instituto de Ciencias Químicas Aplicadas, Facultad de Ingeniería, Universidad Autónoma de Chile, Santiago 7500138, Chile. Electronic address: fjalcala@eeza.csic.es.
  • Ahmad Sharafati
    Department of Civil Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran. asharafati@gmail.com.
  • Antonio Jodar-Abellan
    Soil and Water Conservation Research Group, Centre for Applied Soil Science and Biology of the Segura, Spanish National Research Council (CEBAS-CSIC), Campus de Espinardo 30100, P.O. Box 164, Murcia, Spain. ajodar@cebas.csic.es.