SLEEPYLAND: trust begins with fair evaluation of automatic sleep staging models
Journal:
arXiv
Published Date:
Jun 10, 2025
Abstract
Despite advances in deep learning for automatic sleep staging, clinical
adoption remains limited due to challenges in fair model evaluation,
generalization across diverse datasets, model bias, and variability in human
annotations. We present SLEEPYLAND, an open-source sleep staging evaluation
framework designed to address these barriers. It includes more than 22'0000
hours in-domain (ID) sleep recordings, and more than 84'000 hours out-of-domain
(OOD) sleep recordings, spanning a broad range of ages, sleep-wake disorders,
and hardware setups. We release pre-trained models based on high-performing SoA
architectures and evaluate them under standardized conditions across single-
and multi-channel EEG/EOG configurations. We introduce SOMNUS, an ensemble
combining models across architectures and channel setups via soft voting.
SOMNUS achieves robust performance across twenty-four different datasets, with
macro-F1 scores between 68.7% and 87.2%, outperforming individual models in
94.9% of cases. Notably, SOMNUS surpasses previous SoA methods, even including
cases where compared models were trained ID while SOMNUS treated the same data
as OOD. Using a subset of the BSWR (N=6'633), we quantify model biases linked
to age, gender, AHI, and PLMI, showing that while ensemble improves robustness,
no model architecture consistently minimizes bias in performance and clinical
markers estimation. In evaluations on OOD multi-annotated datasets (DOD-H,
DOD-O), SOMNUS exceeds the best human scorer, i.e., MF1 85.2% vs 80.8% on
DOD-H, and 80.2% vs 75.9% on DOD-O, better reproducing the scorer consensus
than any individual expert (k = 0.89/0.85 and ACS = 0.95/0.94 for healthy/OSA
cohorts). Finally, we introduce ensemble disagreement metrics - entropy and
inter-model divergence based - predicting regions of scorer disagreement with
ROC AUCs up to 0.828, offering a data-driven proxy for human uncertainty.