Application of machine learning (individual vs stacking) models on MERRA-2 data to predict surface PM concentrations over India.
Journal:
Chemosphere
PMID:
37634588
Abstract
The spatial coverage of PM monitoring is non-uniform across India due to the limited number of ground monitoring stations. Alternatively, Modern-Era Retrospective Analysis for Research and Applications, Version 2 (MERRA-2), is an atmospheric reanalysis data used for estimating PM. MERRA-2 does not explicitly measure PM but rather follows an empirical model. MERRA-2 data were spatiotemporally collocated with ground observation for validation across India. Significant underestimation in MERRA-2 prediction of PM was observed over many monitoring stations ranging from -20 to 60 μg m. The utility of Machine Learning (ML) models to overcome this challenge was assessed. MERRA-2 aerosol and meteorological parameters were the input features used to train and test the individual ML models and compare them with the stacking technique. Initially, with 10% of randomly selected data, individual model performance was assessed to identify the best model. XGBoost (XGB) was the best model (r = 0.73) compared to Random Forest (RF) and LightGBM (LGBM). Stacking was then applied by keeping XGB as a meta-regressor. Stacked model results (r = 0.77) outperformed the best standalone estimate of XGB. Stacking technique was used to predict hourly and daily PM in different regions across India and each monitoring station. The eastern region exhibited the best hourly prediction (r = 0.80) and substantial reduction in Mean Bias (MB = -0.03 μg m), followed by the northern region (r = 0.63 and MB = -0.10 μg m), which showed better output due to the frequent observation of PM >100 μg m. Due to sparse data availability to train the ML models, the lowest performance was for the central region (r = 0.46 and MB = -0.60 μg m). Overall, India's PM prediction was good on an hourly basis compared to a daily basis using the ML stacking technique.