A Weakly Supervised Transformer to Support Rare Disease Diagnosis from Electronic Health Records: Methods and Applications in Rare Pulmonary Disease
Journal:
arXiv
Published Date:
Jul 1, 2025
Abstract
Rare diseases affect an estimated 300-400 million people worldwide, yet
individual conditions often remain poorly characterized and difficult to
diagnose due to their low prevalence and limited clinician familiarity. While
computational phenotyping algorithms show promise for automating rare disease
detection, their development is hindered by the scarcity of labeled data and
biases in existing label sources. Gold-standard labels from registries and
expert chart reviews are highly accurate but constrained by selection bias and
the cost of manual review. In contrast, labels derived from electronic health
records (EHRs) cover a broader range of patients but can introduce substantial
noise. To address these challenges, we propose a weakly supervised,
transformer-based framework that combines a small set of gold-standard labels
with a large volume of iteratively updated silver-standard labels derived from
EHR data. This hybrid approach enables the training of a highly accurate and
generalizable phenotyping model that scales rare disease detection beyond the
scope of individual clinical expertise. Our method is initialized by learning
embeddings of medical concepts based on their semantic meaning or co-occurrence
patterns in EHRs, which are then refined and aggregated into patient-level
representations via a multi-layer transformer architecture. Using two rare
pulmonary diseases as a case study, we validate our model on EHR data from
Boston Children's Hospital. Our framework demonstrates notable improvements in
phenotype classification, identification of clinically meaningful subphenotypes
through patient clustering, and prediction of disease progression compared to
baseline methods. These results highlight the potential of our approach to
enable scalable identification and stratification of rare disease patients for
clinical care and research applications.