A natural language processing approach to support biomedical data harmonization: Leveraging large language models.

Journal: PloS one

Published Date: Jul 24, 2025

Abstract

BACKGROUND: Biomedical research requires large, diverse samples to produce unbiased results. Retrospective data harmonization is often used to integrate existing datasets to create these samples, but the process is labor-intensive. Automated methods for matching variables across datasets can accelerate this process, particularly when harmonizing datasets with numerous variables and varied naming conventions. Research in this area has been limited, primarily focusing on lexical matching and ontology-based semantic matching. We aimed to develop new methods, leveraging large language models (LLMs) and ensemble learning, to automate variable matching.

Authors

Zexu Li

Key Laboratory of Bioresource Research and Development of Liaoning Province, College of Life and Health Sciences, Northeastern University, Shenyang, 110819, China; National Frontiers Science Center for Industrial Intelligence and Systems Optimization, Northeastern University, Shenyang, 110819, China; Key Laboratory of Data Analytics and Optimization for Smart Industry, Northeastern University, Ministry of Education, Shenyang, 110819, China.
Suraj P Prabhu

Department of Bioinformatics, Boston University Faculty of Computing & Data Sciences, Boston, Massachusetts, United States of America.
Zachary T Popp

Department of Anatomy and Neurobiology, Boston University Chobanian & Avedisian School of Medicine, Boston, Massachusetts, United States of America.
Shubhi S Jain

Slone Epidemiology Center, Boston University Chobanian & Avedisian School of Medicine, Boston, Massachusetts, United States of America.
Vijetha Balakundi

Department of Medicine/Section of Preventive Medicine and Epidemiology, Boston University Chobanian & Avedisian School of Medicine, Boston, Massachusetts, United States of America.
Ting Fang Alvin Ang

The Framingham Heart Study, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
Rhoda Au

Boston University School of Medicine, rhodaau@bu.edu.
Jinying Chen

Department of Quantitative Health Sciences, University of Massachusetts Medical School, Worcester, MA, United States. Electronic address: jinying.chen@umassmed.edu.

Keywords

Alzheimer Disease Biomedical Research Humans Japan Language Large Language Models Natural Language Processing Semantics

External Resources

View on PubMed Access via DOI PubMed (40705832)

A natural language processing approach to support biomedical data harmonization: Leveraging large language models.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

A natural language processing approach to support biomedical data harmonization: Leveraging large language models.

Abstract

Authors

Keywords

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals