Gender-based data bias and model fairness evaluation in benchmarked open-access disease prediction datasets.

Journal: Computers in biology and medicine
Published Date:

Abstract

The widespread use of open-access datasets for validating machine learning (ML) models has raised critical concerns about data bias and model fairness, particularly in relation to gender. This study systematically investigates gender-based data bias in disease prediction datasets and evaluates the fairness of ML algorithms trained on them. A total of 74 datasets were selected from Kaggle and the UCI Machine Learning Repository, based on the inclusion of gender as a feature and classification labels. Data bias was quantified using Earth Mover's Distance to measure disparities in class-wise gender distributions, with statistical significance assessed via bootstrapping. Fairness was evaluated across seven ML algorithms (Decision Tree, Random Forest, Logistic Regression, Artificial Neural Networks, Support Vector Machine, K-Nearest Neighbours, and Naïve Bayes) using k-fold cross-validation and statistical tests. Two fairness definitions, Equalised Odds and Treatment Equality, were applied. Results showed that 35 datasets exhibited gender-based data bias, disproportionately affecting females. Heart disease datasets had the highest prevalence of data bias, while the lung cancer and mental health datasets were found to be bias-free. Fairness outcomes varied significantly across algorithms, with Decision Tree showing the fewest issues and Logistic Regression the most. Bias-free datasets consistently produced fewer fairness concerns, with statistically significant differences (p < 0.01) across all algorithm groups. These findings highlight the importance of addressing gender-based data bias and selecting appropriate algorithms to improve fairness in ML applications. The study highlights the importance of addressing gender-based data bias in enhancing model fairness. It contributes to the development of equitable AI systems, thereby supporting data-driven decision-making in healthcare.

Authors

Keywords

No keywords available for this article.