Beyond accuracy: a framework for evaluating algorithmic bias and performance, applied to automated sleep scoring.

Journal: Scientific reports

Published Date: Jul 1, 2025

Abstract

Recent advancements in artificial intelligence (AI) have significantly improved sleep-scoring algorithms, bringing their performance close to the theoretical limit of approximately 80%, which aligns with inter-scorer agreement levels. While this suggests the problem is technically solved, clinical adoption remains challenging due to ethical and regulatory requirements for rigorous validation, fairness, and human oversight. Existing validation methods, such as Bland-Altman analysis, often rely on simple correlation metrics, overlooking potential non-linear influences of external factors (e.g., demographic or clinical variables) on systematic predictive errors (biases) in derived clinical markers. Additionally, performance metrics are typically reported as the mean of on-subject results, neglecting critical scenarios-such as different quantiles-that could better convey the algorithm's capabilities and limitations to clinicians as end-users. To address this gap, we propose a universal framework for quantifying both performance metrics and biases in predictive algorithmic tools. Our approach extends conventional validation methods by analyzing how external factors shape the entire distribution of predictive performance and errors, rather than just the expected mean. Applying it to the widely recognized U-Sleep and YASA sleep-scoring algorithms, we identify biases-such as age-related shifts-indicating missing input information or imbalances in training data. Despite these biases, we illustrate that both algorithms maintain non-inferior performance in the risk assessment of sleep apnea based on prediction-derived markers, highlighting the potential and clinical utility of algorithmic insights.

Authors

Michal Bechny

Department of Innovative Technologies, Institute of Digital Technologies for Personalized Healthcare (MeDiTech), University of Applied Sciences and Arts of Southern Switzerland, Lugano, Switzerland.
Luigi Fiorillo

Department of Innovative Technologies, Institute of Digital Technologies for Personalized Healthcare (MeDiTech), University of Applied Sciences and Arts of Southern Switzerland, Lugano, Switzerland.
Julia van der Meer

Department of Neurology, Inselspital, Bern University Hospital and University of Bern, Bern, 3010, Switzerland.
Markus Schmidt

Biofaction KG, 1030 Vienna, Austria.
Claudio Bassetti

Department of Neurology, Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland.
Athina Tzovara

Institute of Computer Science, University of Bern, Bern, Switzerland.
Francesca Faraci

Institute of Digital Technologies for Personalized Healthcare (MeDiTech), University of Applied Sciences and Arts of Southern Switzerland (SUPSI), Lugano, 6962, Switzerland.

Keywords

Adult Algorithms Artificial Intelligence Bias Female Humans Male Middle Aged Polysomnography Reproducibility of Results Sleep

External Resources

View on PubMed Access via DOI PubMed (40595988)

Beyond accuracy: a framework for evaluating algorithmic bias and performance, applied to automated sleep scoring.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

Beyond accuracy: a framework for evaluating algorithmic bias and performance, applied to automated sleep scoring.

Abstract

Authors

Keywords

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals