Building a PubMed knowledge graph.

Journal: Scientific data
Published Date:

Abstract

PubMed is an essential resource for the medical domain, but useful concepts are either difficult to extract or are ambiguous, which has significantly hindered knowledge discovery. To address this issue, we constructed a PubMed knowledge graph (PKG) by extracting bio-entities from 29 million PubMed abstracts, disambiguating author names, integrating funding data through the National Institutes of Health (NIH) ExPORTER, collecting affiliation history and educational background of authors from ORCID, and identifying fine-grained affiliation data from MapAffil. Through the integration of these credible multi-source data, we could create connections among the bio-entities, authors, articles, affiliations, and funding. Data validation revealed that the BioBERT deep learning method of bio-entity extraction significantly outperformed the state-of-the-art models based on the F1 score (by 0.51%), with the author name disambiguation (AND) achieving an F1 score of 98.09%. PKG can trigger broader innovations, not only enabling us to measure scholarly impact, knowledge usage, and knowledge transfer, but also assisting us in profiling authors and organizations based on their connections with bio-entities.

Authors

  • Jian Xu
    Department of Cardiology, Lishui Central Hospital and the Fifth Affiliated Hospital of Wenzhou Medical University, Lishui, China.
  • Sunkyu Kim
    Department of Computer Science and Engineering, Korea University, Seoul 02841, South Korea.
  • Min Song
    Library and Information Science, Yonsei University, Seoul, South Korea.
  • Minbyul Jeong
    Department of Computer Science and Engineering, Korea University, Seoul, South Korea.
  • Donghyeon Kim
  • Jaewoo Kang
    Department of Computer Science and Engineering, Korea University, Seoul, Republic of Korea.
  • Justin F Rousseau
    Dell Medical School, University of Texas at Austin, Austin, TX, USA.
  • Xin Li
    Veterinary Diagnostic Center, Shanghai Animal Disease Control Center, Shanghai, China.
  • Weijia Xu
    Texas Advanced Computing Center, Austin, TX, USA.
  • Vetle I Torvik
    School of Information Sciences, University of Illinois at Urbana-Champaign, Champaign, IL, USA.
  • Yi Bu
    Department of Information Management, Peking University, Beijing, China.
  • Chongyan Chen
    School of Information, University of Texas at Austin, Austin, TX, USA.
  • Islam Akef Ebeid
    School of Information, University of Texas at Austin, Austin, TX, USA.
  • Daifeng Li
    School of Information Management, Sun Yat-sen University, Guangzhou, China. lidaifeng@mail.sysu.edu.cn.
  • Ying Ding
    Cockrell School of Engineering, The University of Texas at Austin, Austin, USA.