Exploring multivariate machine learning frameworks to parallelize PM simultaneous estimations across the continental United States.
Journal:
Environmental pollution (Barking, Essex : 1987)
PMID:
40204145
Abstract
Fine particulate matter (PM2.5) comprises diverse chemical components, including elemental carbon (EC), silicon (SI), sulfate (SO), and calcium (CA), each linked to varied health and environmental impacts. Accurately estimating these components' spatial and temporal distributions is crucial for regulatory policies and public health. This study developed and evaluated multivariate machine learning models, including Random Forest (RF) and XGBoost (XGB), to estimate daily concentrations of EC, SI, SO, and CA across the contiguous United States from 2000 to 2019. Unlike traditional univariate approaches, multivariate models capture interdependencies among components, improving accuracy and efficiency. Using data from 534 monitoring sites and 187 predictor variables derived from satellite observations, reanalysis datasets, and geographical sources, we implemented univariate and multivariate RF and XGB models (MRF and MXGBoost). Performance was assessed using R-squared metrics, and feature importance was evaluated with SHAP values. MXGBoost outperformed other models, achieving R values of 70.2 % for EC, 79.23 % for SO, 61.57 % for SI, and 59.5 % for CA, with spatial R exceeding 93 % and temporal R as high as 82.23 % for SO. Key predictors included wind speed, relative humidity, and aerosol optical depth. The findings highlight the advantages of multivariate modeling in capturing the interdependencies among PM2.5 components, resulting in improved estimation accuracy and computational efficiency. This approach offers valuable applications in air quality management and public health, emphasizing the need to refine multivariate frameworks and explore their applicability to other pollutants.