Semi-Automated Data Curation from Biomedical Literature.

Journal: AMIA ... Annual Symposium proceedings. AMIA Symposium
Published Date:

Abstract

Data curation is a bottleneck for many informatics pipelines. A specific example of this is aggregating data from preclinical studies to identify novel genetic pathways for atherosclerosis in humans. This requires extracting data from published mouse studies such as the perturbed gene and its impact on lesion sizes and plaque inflammation, which is non-trivial. Curation efforts are resource-heavy, with curators manually extracting data from hundreds of publications. In this work, we describe the development of a semi-automated curation tool to accelerate data extraction. We use natural language processing (NLP) methods to auto-populate a web-based form which is then reviewed by a curator. We conducted a controlled user study to evaluate the curation tool. Our NLP model has a 70% accuracy on categorical fields and our curation tool accelerates task completion time by 49% compared to manual curation.

Authors

  • Protiva Rahman
    Vanderbilt University Medical Center, Nashville, TN.
  • Daniel Fabbri
    Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, United States of America.