Early Detection of Rare Disease Using Hierarchical Set-to-Sequence Modeling of Structured Electronic Health Records
Journal:
medRxiv
Published Date:
May 6, 2026
Abstract
Rare diseases are characterized by heterogeneous, weak, and sparse phenotypic signals that emerge gradually across longitudinal clinical visits, making early detection a persistent challenge. In this study, we propose a hierarchical set-to-sequence (HSS) framework for prospective rare disease detection using structured EHR data. HSS decomposes the problem into two levels: (1) intra-visit encoding via Multi-Query Attention (MQA), which treats heterogeneous clinical events within a single clinical visit as an unordered set to generate unified visit-level representations, and (2) inter-visit temporal modeling with transformer encoders conditioned on patient visit age and inter-visit time gaps to capture the disease progression and the irregular intervals between clinical visits. We construct a real-world cohort of 40,223 patients comprising 708,422 visits from a single academic medical center (2005--2025), with 3,032 rare disease cases identified by curated rule-based phenotyping including severe neuro-developmental, congenital, or genetic conditions. We formulate the task as multi-horizon prospective binary classification with five prediction horizons of 7, 30, 90, 180, and 365 days prior to first diagnosis. Experimental results show that the proposed HSS model consistently outperforms linear logistic regression, tree-based XGBoost, and Transformer-based baselines at every prediction horizon, ranging from AUROC = 0.893 and AUPRC = 0.601 at 7 days with 5.17% prevalence to AUROC = 0.829 and AUPRC = 0.228 at 365 days with at 3.98% prevalence. Notably, the performance gap between HSS and the strongest competing baseline is largest at the 365 days horizon, indicating stronger advantages for long-horizon prediction where phenotypic signals for rare diseases are weak and sparse. Additional analyses further clarify the contribution of the hierarchical components and confirm the importance of hierarchical modeling. This work contributes to the ongoing development of AI methodologies tailored to rare diseases by introducing a hierarchical framework for early detection using structured longitudinal clinical data.