A study on the effectiveness of machine learning models for hepatitis prediction.
Journal:
Scientific reports
Published Date:
Aug 20, 2025
Abstract
Hepatitis continues to be a major global health challenge, leading to high morbidity and mortality rates. Despite advances in diagnosis and treatment, early prediction of hepatitis outcomes remains an essential area for improvement. This study seeks to address this gap by applying a range of advanced machine learning (ML) algorithms to predict hepatitis, contributing to global efforts to enhance public health outcomes. The study utilized the hepatitis dataset from the UCI repository, which includes 155 participants and 20 attributes related to demographics, clinical data, and laboratory results. Given the limited sample size, we adopted a diverse set of machine learning techniques to mitigate the risk of overfitting and improve generalizability. Feature selection was performed using the Boruta algorithm. We employed one traditional predictive model, logistic regression, alongside six machine learning models: support vector machine (SVM), K-nearest neighbors (KNN), artificial neural network (ANN), random forest (RF), AdaBoost, and XGBoost. Model performance was evaluated using key metrics including accuracy, sensitivity, specificity, precision, and F1 score. The analysis revealed that 89.7% of participants were male, and 83.9% reported fatigue as the primary symptom. Using the Boruta algorithm, key predictors of hepatitis survival outcomes were identified, including Ascites, Varices, Bilirubin, Age, Spiders, and Alkaline Phosphate. Among the classification models evaluated, RF achieved the highest overall performance with 92.42% accuracy (95% CI 88.25-96.59), 96.77% precision (CI 93.99-99.55), 95.24% sensitivity (CI 91.89-98.59), and 96.00% F1 score (CI 92.91-99.09), despite lower specificity at 33.33% (CI 25.91-40.75). LR also performed well, with 85.00% accuracy (CI 79.38-90.62), 94.03% precision (CI 90.30-97.76), 88.73% sensitivity (CI 83.75-93.71), and 91.30% F1 score (CI 86.86-95.74), though its specificity was moderate at 55.56% (CI 47.74-63.38). SVM showed strong sensitivity (89.71%) and F1 score (90.37%) with moderate accuracy (83.75%) but low specificity (50.00%). Other models such as KNN, ANN, AdaBoost, and XGBoost showed varying balances of performance, with AdaBoost having the highest specificity (95.65%) but lowest sensitivity (50.00%). Overall, RF was the most effective classifier in predicting hepatitis outcomes. The application of machine learning methodologies for predicting survival outcomes in hepatitis can significantly improve healthcare delivery and reduce the impact of hepatitis and other communicable diseases, supporting the achievement of sustainable development goal 3.3, which focuses on eradicating epidemics. The findings indicate that the random forest model, combined with the Boruta algorithm for feature selection, is the most effective for predicting hepatitis outcomes, excelling in accuracy, precision, and sensitivity.