Semi-supervised Clustering Through Representation Learning of Large-scale EHR Data
Journal:
arXiv
Published Date:
May 27, 2025
Abstract
Electronic Health Records (EHR) offer rich real-world data for personalized
medicine, providing insights into disease progression, treatment responses, and
patient outcomes. However, their sparsity, heterogeneity, and high
dimensionality make them difficult to model, while the lack of standardized
ground truth further complicates predictive modeling. To address these
challenges, we propose SCORE, a semi-supervised representation learning
framework that captures multi-domain disease profiles through patient
embeddings. SCORE employs a Poisson-Adapted Latent factor Mixture (PALM) Model
with pre-trained code embeddings to characterize codified features and extract
meaningful patient phenotypes and embeddings. To handle the computational
challenges of large-scale data, it introduces a hybrid Expectation-Maximization
(EM) and Gaussian Variational Approximation (GVA) algorithm, leveraging limited
labeled data to refine estimates on a vast pool of unlabeled samples. We
theoretically establish the convergence of this hybrid approach, quantify GVA
errors, and derive SCORE's error rate under diverging embedding dimensions. Our
analysis shows that incorporating unlabeled data enhances accuracy and reduces
sensitivity to label scarcity. Extensive simulations confirm SCORE's superior
finite-sample performance over existing methods. Finally, we apply SCORE to
predict disability status for patients with multiple sclerosis (MS) using
partially labeled EHR data, demonstrating that it produces more informative and
predictive patient embeddings for multiple MS-related conditions compared to
existing approaches.