Beyond accuracy: a framework for evaluating algorithmic bias and performance, applied to automated sleep scoring.

Journal: Scientific reports
Published Date:

Abstract

Recent advancements in artificial intelligence (AI) have significantly improved sleep-scoring algorithms, bringing their performance close to the theoretical limit of approximately 80%, which aligns with inter-scorer agreement levels. While this suggests the problem is technically solved, clinical adoption remains challenging due to ethical and regulatory requirements for rigorous validation, fairness, and human oversight. Existing validation methods, such as Bland-Altman analysis, often rely on simple correlation metrics, overlooking potential non-linear influences of external factors (e.g., demographic or clinical variables) on systematic predictive errors (biases) in derived clinical markers. Additionally, performance metrics are typically reported as the mean of on-subject results, neglecting critical scenarios-such as different quantiles-that could better convey the algorithm's capabilities and limitations to clinicians as end-users. To address this gap, we propose a universal framework for quantifying both performance metrics and biases in predictive algorithmic tools. Our approach extends conventional validation methods by analyzing how external factors shape the entire distribution of predictive performance and errors, rather than just the expected mean. Applying it to the widely recognized U-Sleep and YASA sleep-scoring algorithms, we identify biases-such as age-related shifts-indicating missing input information or imbalances in training data. Despite these biases, we illustrate that both algorithms maintain non-inferior performance in the risk assessment of sleep apnea based on prediction-derived markers, highlighting the potential and clinical utility of algorithmic insights.

Authors

  • Michal Bechny
    Department of Innovative Technologies, Institute of Digital Technologies for Personalized Healthcare (MeDiTech), University of Applied Sciences and Arts of Southern Switzerland, Lugano, Switzerland.
  • Luigi Fiorillo
    Department of Innovative Technologies, Institute of Digital Technologies for Personalized Healthcare (MeDiTech), University of Applied Sciences and Arts of Southern Switzerland, Lugano, Switzerland.
  • Julia van der Meer
    Department of Neurology, Inselspital, Bern University Hospital and University of Bern, Bern, 3010, Switzerland.
  • Markus Schmidt
    Biofaction KG, 1030 Vienna, Austria.
  • Claudio Bassetti
    Department of Neurology, Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland.
  • Athina Tzovara
    Institute of Computer Science, University of Bern, Bern, Switzerland.
  • Francesca Faraci
    Institute of Digital Technologies for Personalized Healthcare (MeDiTech), University of Applied Sciences and Arts of Southern Switzerland (SUPSI), Lugano, 6962, Switzerland.