Robust Automated Harmonization of Heterogeneous Data Through Ensemble Machine Learning: Algorithm Development and Validation Study.

Journal: JMIR medical informatics
Published Date:

Abstract

BACKGROUND: Cohort studies contain rich clinical data across large and diverse patient populations and are a common source of observational data for clinical research. Because large scale cohort studies are both time and resource intensive, one alternative is to harmonize data from existing cohorts through multicohort studies. However, given differences in variable encoding, accurate variable harmonization is difficult.

Authors

  • Doris Yang
    Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States.
  • Doudou Zhou
    Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.
  • Steven Cai
    Department of Computer Science, Rensselaer Polytechnic Institute, Rochester, NY, United States.
  • Ziming Gan
    Department of Statistics, University of Chicago, Chicago, IL, United States.
  • Michael Pencina
    Duke Clinical Research Institute, Durham, NC, USA.
  • Paul Avillach
    Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA.
  • Tianxi Cai
    Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, United States.
  • Chuan Hong
    Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.