Unsupervised machine learning for the discovery of latent disease clusters and patient subgroups using electronic health records.

Journal: Journal of biomedical informatics
Published Date:

Abstract

Machine learning has become ubiquitous and a key technology on mining electronic health records (EHRs) for facilitating clinical research and practice. Unsupervised machine learning, as opposed to supervised learning, has shown promise in identifying novel patterns and relations from EHRs without using human created labels. In this paper, we investigate the application of unsupervised machine learning models in discovering latent disease clusters and patient subgroups based on EHRs. We utilized Latent Dirichlet Allocation (LDA), a generative probabilistic model, and proposed a novel model named Poisson Dirichlet Model (PDM), which extends the LDA approach using a Poisson distribution to model patients' disease diagnoses and to alleviate age and sex factors by considering both observed and expected observations. In the empirical experiments, we evaluated LDA and PDM on three patient cohorts, namely Osteoporosis, Delirium/Dementia, and Chronic Obstructive Pulmonary Disease (COPD)/Bronchiectasis Cohorts, with their EHR data retrieved from the Rochester Epidemiology Project (REP) medical records linkage system, for the discovery of latent disease clusters and patient subgroups. We compared the effectiveness of LDA and PDM in identifying disease clusters through the visualization of disease representations. We tested the performance of LDA and PDM in differentiating patient subgroups through survival analysis, as well as statistical analysis of demographics and Elixhauser Comorbidity Index (ECI) scores in those subgroups. The experimental results show that the proposed PDM could effectively identify distinguished disease clusters based on the latent patterns hidden in the EHR data by alleviating the impact of age and sex, and that LDA could stratify patients into differentiable subgroups with larger p-values than PDM. However, those subgroups identified by LDA are highly associated with patients' age and sex. The subgroups discovered by PDM might imply the underlying patterns of diseases of greater interest in epidemiology research due to the alleviation of age and sex. Both unsupervised machine learning approaches could be leveraged to discover patient subgroups using EHRs but with different foci.

Authors

  • Yanshan Wang
    Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA.
  • Yiqing Zhao
    Department of Health Informatics and Administration, Center for Biomedical Data and Language Processing, University of Wisconsin-Milwaukee, 2025 E Newport Ave, NWQ-B Room 6469, Milwaukee, WI, 53211, USA.
  • Terry M Therneau
    Department of Health Science Research, Mayo Clinic, MN, USA.
  • Elizabeth J Atkinson
    Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, MN.
  • Ahmad P Tafti
    School of Health and Rehabilitation Sciences, University of Pittsburgh, Pittsburgh, PA 15260, USA.
  • Nan Zhang
    Department of Pulmonary and Critical Care Medicine II, Emergency General Hospital, Beijing, China.
  • Shreyasee Amin
    Division of Rheumatology, Department of Medicine, Mayo Clinic, 200 1st ST SW, Rochester, MN, 55905, USA.
  • Andrew H Limper
    Division of Pulmonary and Critical Care Medicine, Department of Internal Medicine, Mayo Clinic, Rochester, MN, USA.
  • Sundeep Khosla
    Division of Endocrinology and Kogod Center on Aging, Department of Internal Medicine, Mayo Clinic, Rochester, MN, USA.
  • Hongfang Liu
    Department of Artificial Intelligence & Informatics, Mayo Clinic, Rochester, MN, United States.