Automated Sleep Stage and Event Detection Algorithms Using Quality-Controlled PSG Annotations

Journal: medRxiv
Published Date:

Abstract

To develop machine-learning models for sleep stage classification, arousal detection, and respiratory event detection from polysomnography (PSG), and to evaluate performance against expert scorers. Overnight PSG recordings were collected from healthy participants and patients with suspected sleep-disordered breathing. Four certified scorers underwent calibration sessions and produced reference annotations for sleep stages, arousals, and respiratory events. In addition, all four scorers annotated a subset of recordings for consensus analyses. Gradient-boosted decision tree models were trained on hand-crafted features from electroencephalography, electrooculography, electromyography, nasal pressure airflow, and oxygen saturation using five-fold cross-validation. Performance was assessed using epoch-wise accuracy, Cohen’s κ, F1-score, event-based precision, recall, and F1-score based on intersection-over-union matching, and recording-level agreement for total sleep time, arousal index, and apnea-hypopnea index (AHI). Model performance was compared with inter-scorer agreement derived from the consensus dataset. The sleep stage model achieved accuracy 0.840, Cohen’s κ 0.791, and F1-score 0.841, with total sleep time residuals centered near zero and limits of agreement of approximately ±0.5 h. Arousal detection yielded recall 0.725, precision 0.742, and F1-score 0.733, with arousal index limits of agreement of about ±15 events/h. Respiratory event detection achieved recall 0.829, precision 0.807, and F1-score 0.818, with similar agreement for AHI. In consensus analyses, model performance was comparable to or within the range of inter-scorer agreement. Feature-based decision-tree models can approach human-level accuracy across PSG scoring tasks, highlighting the importance of annotation consistency and supporting their use as decision-support tools in clinical practice. Accurate and consistent scoring of overnight sleep studies remains a major bottleneck in sleep medicine, limiting timely diagnosis and effective treatment. This study demonstrates that carefully designed, interpretable machine learning models trained on rigorously calibrated expert annotations can reach performance comparable to human scorers for sleep stages, arousals, and respiratory events. The findings suggest that the primary barrier to robust automation is not model complexity but the reliability of human labels. Automated scoring based on such models could provide a stable reference for training clinicians, standardizing quality across centers, and enabling large-scale research. Critical next steps include external validation, prospective outcome studies, and integration into routine clinical workflows.

Authors

  • Michiru Kaneda; Sho Ogaki; Tomoyuki Nohara; Syuhei Fujita; Naoshi Osako; Tomoko Yagi; Yasuhiro Tomita; Takanori Ogata

Categories