Learning lifetime disease liability reveals and removes genetic confounding in electronic health records

Journal: medRxiv
Published Date:

Abstract

Electronic health records (EHRs) have become the cornerstone of population-scale genetic studies1, but factors including patterns of healthcare use shape which and how diagnoses are recorded, leading to confounding effects in genetic associations with EHR codes2. In this study we propose EDGAR, a deep learning framework that recovers lifetime disease liability from EHR by aligning diagnostic codes with clinically validated measures and disease labels in a set of individuals prioritized through active learning. EDGAR yields representations that better capture disease-specific effects in genome-wide association analyses (GWAS). It also enables us to isolate a genetic factor that captures systemic biases in EHR codes, which distorts cross-disease correlations and drives spurious links with behavioral and socio-economic traits. We find that this factor generalizes across EHRs, and its identification in one EHR enables its removal from existing GWAS in another. Overall, our work presents a promising direction for improving specificity of EHR-based GWAS.

Authors

  • Di
  • Y.; Cai
  • N.