Predicting car accident severity in Northwest Ethiopia: a machine learning approach leveraging driver, environmental, and road conditions.
Journal:
Scientific reports
Published Date:
Jul 1, 2025
Abstract
Road traffic accidents (RTAs) in Northwest Ethiopia, a region with a fatality rate of 32.2 per 100,000 residents, pose a critical public health challenge exacerbated by infrastructural deficits and environmental hazards. This study leverages machine learning (ML) to predict accident severity, addressing gaps in localized predictive frameworks for low- and middle-income countries (LMICs). Our study aims to predict the severity of car accidents in Northwest Ethiopia via machine-learning techniques. Using a dataset of 2,000 accidents (2018-2023) from police reports, we integrated driver demographics, behavioral factors (e.g., alcohol use, seatbelt compliance), and environmental conditions (e.g., unpaved roads, weather) in North West Ethiopia. Ten ML models, including Random Forest, XGBoost, and LightGBM, were evaluated after addressing class imbalance via the Synthetic Minority Oversampling Technique (SMOTE). Hyperparameter tuning and Shapley Additive explanations (SHAP) provided model optimization and interpretability. Random Forest outperformed other models, achieving 82% accuracy (AUC-ROC: 0.87) post-tuning. Driver age (mean: 44 years) and environmental factors (e.g., nighttime on unlit roads, rainy conditions) were critical predictors, increasing fatal accident likelihood by 62%. SMOTE improved the accuracy of the outperforming random forest accuracy from 78.6 to 82%. Random Forest exhibited the highest recall (0.82) after optimization, while ensemble methods dominated performance metrics. The study underscores the efficacy of ML in contextualizing accident severity in LMICs, with Random Forest emerging as a robust tool for policymakers. Prioritizing road paving, sobriety checkpoints, and motorcycle safety could mitigate risks, aligning with Sustainable Development Goal 3.6. Future work should address data limitations (underreporting, geospatial gaps) and expand model interpretability.