A natural language processing approach to support biomedical data harmonization: Leveraging large language models.

Journal: PloS one
Published Date:

Abstract

BACKGROUND: Biomedical research requires large, diverse samples to produce unbiased results. Retrospective data harmonization is often used to integrate existing datasets to create these samples, but the process is labor-intensive. Automated methods for matching variables across datasets can accelerate this process, particularly when harmonizing datasets with numerous variables and varied naming conventions. Research in this area has been limited, primarily focusing on lexical matching and ontology-based semantic matching. We aimed to develop new methods, leveraging large language models (LLMs) and ensemble learning, to automate variable matching.

Authors

  • Zexu Li
    Key Laboratory of Bioresource Research and Development of Liaoning Province, College of Life and Health Sciences, Northeastern University, Shenyang, 110819, China; National Frontiers Science Center for Industrial Intelligence and Systems Optimization, Northeastern University, Shenyang, 110819, China; Key Laboratory of Data Analytics and Optimization for Smart Industry, Northeastern University, Ministry of Education, Shenyang, 110819, China.
  • Suraj P Prabhu
    Department of Bioinformatics, Boston University Faculty of Computing & Data Sciences, Boston, Massachusetts, United States of America.
  • Zachary T Popp
    Department of Anatomy and Neurobiology, Boston University Chobanian & Avedisian School of Medicine, Boston, Massachusetts, United States of America.
  • Shubhi S Jain
    Slone Epidemiology Center, Boston University Chobanian & Avedisian School of Medicine, Boston, Massachusetts, United States of America.
  • Vijetha Balakundi
    Department of Medicine/Section of Preventive Medicine and Epidemiology, Boston University Chobanian & Avedisian School of Medicine, Boston, Massachusetts, United States of America.
  • Ting Fang Alvin Ang
    The Framingham Heart Study, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
  • Rhoda Au
    Boston University School of Medicine, rhodaau@bu.edu.
  • Jinying Chen
    Department of Quantitative Health Sciences, University of Massachusetts Medical School, Worcester, MA, United States. Electronic address: jinying.chen@umassmed.edu.