Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data
Journal:
arXiv
Published Date:
Feb 12, 2025
Abstract
The adoption of EHRs has expanded opportunities to leverage data-driven
algorithms in clinical care and research. A major bottleneck in effectively
conducting multi-institutional EHR studies is the data heterogeneity across
systems with numerous codes that either do not exist or represent different
clinical concepts across institutions. The need for data privacy further limits
the feasibility of including multi-institutional patient-level data required to
study similarities and differences across patient subgroups. To address these
challenges, we developed the GAME algorithm. Tested and validated across 7
institutions and 2 languages, GAME integrates data in several levels: (1) at
the institutional level with knowledge graphs to establish relationships
between codes and existing knowledge sources, providing the medical context for
standard codes and their relationship to each other; (2) between institutions,
leveraging language models to determine the relationships between
institution-specific codes with established standard codes; and (3) quantifying
the strength of the relationships between codes using a graph attention
network. Jointly trained embeddings are created using transfer and federated
learning to preserve data privacy. In this study, we demonstrate the
applicability of GAME in selecting relevant features as inputs for AI-driven
algorithms in a range of conditions, e.g., heart failure, rheumatoid arthritis.
We then highlight the application of GAME harmonized multi-institutional EHR
data in a study of Alzheimer's disease outcomes and suicide risk among patients
with mental health disorders, without sharing patient-level data outside
individual institutions.