The effect of human ratings reliability on machine learning model performance: a case study in infant pain assessment.

Journal: Scientific reports
Published Date:

Abstract

Human-annotated data is foundational for supervised machine learning (ML). Low inter-rater reliability often introduces noise that degrades model performance. This study investigates how human rating reliability and panel size impact ML efficacy, and introduces a novel debiasing procedure utilizing Random Effects Models (REMs) to mitigate annotator noise. We conducted two complementary experiments to evaluate these dynamics. Experiment 1 analyzed real-world assessments from nine evaluators classifying 355 infant images. Results demonstrated that panel size and specific psychometric reliability indices-namely Cronbach's α, Generalizability, and Dependability coefficients-are strong predictors of ML performance across four algorithms, whereas inter-class correlation coefficients proved less robust. Experiment 2 generated simulated datasets mimicking Experiment 1 - incorporating 40 virtual raters with varying expertise levels and structured pattern noise - to evaluate the robustness of model aggregation across diverse rating scenarios. These simulations confirmed that while annotator noise significantly impairs classification, the proposed REM-based debiasing procedure effectively recovers ground-truth scores. Notably, ML models trained on REM-debiased data from merely two raters achieved predictive performance comparable to models utilizing mean-aggregated scores from eight raters. Ultimately, this study underscores the critical importance of psychometrically sound data curation, demonstrating that advanced debiasing techniques can substantially enhance ML accuracy and efficiency even with small expert panels.

Authors

Keywords

No keywords available for this article.