Text mining and machine learning based health risk prediction for soil polycyclic aromatic hydrocarbons at typical coal-fired industrial sites in china.

Journal: Journal of hazardous materials
Published Date:

Abstract

Polycyclic aromatic hydrocarbons (PAHs) are persistent organic pollutants posing significant threats to the environment and human health, particularly at active coal-fired industrial sites (coking, steel smelting, and thermal electric power generation) in China. However, for active industrial enterprises, effectively predicting PAHs exposure risks at the national scale remains challenging due to spatial uncertainty of contamination distribution and difficulties in the acquisition of large-scale monitoring data. This study introduces a multidisciplinary framework integrating text mining, probabilistic risk assessment, and machine learning to predict and assess health risks of PAHs in soils at these active industrial sites. Text mining extracted comprehensive PAHs-related data from literature, forming a national database with over 1600 entries. Probabilistic risk assessment with Monte Carlo simulation (1000 iterations per sample) revealed that 32% of historically reported sites in the text-mined database exceeded the acceptable target risk (ATR = 1 ×10-6), with benzo[a]pyrene (BaP) and dibenzo[a,h]anthracene (DahA) emerging as primary risk drivers, while non-carcinogenic risks were mostly below safety thresholds (hazard quotient < 1). Four machine learning algorithms including Support Vector Machine (SVM), Random Forest (RF), Extreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM) were evaluated for risk classification, with LightGBM achieving optimal performance (accuracy: 98.88%, ROC-AUC: 1.0000). Feature importance analysis identified enterprise insured population, regional atmospheric CO concentration, and particulate emission limits as key predictors. National-scale prediction for 1263 sites identified 15.30% exceeding the ATR, predominantly concentrated in eastern and central China. Industry-specific analysis showed coking plants (39.40%) and steel smelting facilities (38.00%) exhibited higher proportions of sites exceeding the ATR than thermal power plants (22.60%), reflecting process-specific PAHs generation patterns. This framework provides an efficient, scalable approach for PAHs risk assessment and management at coal-fired sites in China, offering a replicable model for similar environmental challenges globally.

Authors

Keywords

No keywords available for this article.