A Combined Manual Annotation and Deep-Learning Natural Language Processing Study on Accurate Entity Extraction in Hereditary Disease Related Biomedical Literature.

Journal: Interdisciplinary sciences, computational life sciences
PMID:

Abstract

We report a combined manual annotation and deep-learning natural language processing study to make accurate entity extraction in hereditary disease related biomedical literature. A total of 400 full articles were manually annotated based on published guidelines by experienced genetic interpreters at Beijing Genomics Institute (BGI). The performance of our manual annotations was assessed by comparing our re-annotated results with those publicly available. The overall Jaccard index was calculated to be 0.866 for the four entity types-gene, variant, disease and species. Both a BERT-based large name entity recognition (NER) model and a DistilBERT-based simplified NER model were trained, validated and tested, respectively. Due to the limited manually annotated corpus, Such NER models were fine-tuned with two phases. The F1-scores of BERT-based NER for gene, variant, disease and species are 97.28%, 93.52%, 92.54% and 95.76%, respectively, while those of DistilBERT-based NER are 95.14%, 86.26%, 91.37% and 89.92%, respectively. Most importantly, the entity type of variant has been extracted by a large language model for the first time and a comparable F1-score with the state-of-the-art variant extraction model tmVarĀ has been achieved.

Authors

  • Dao-Ling Huang
    BGI Research, Shenzhen, 518083, China. dlhuang1217@gmail.com.
  • Quanlei Zeng
    BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China.
  • Yun Xiong
    Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, Shanghai, China.
  • Shuixia Liu
    BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China.
  • Chaoqun Pang
    BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China.
  • Menglei Xia
    BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China.
  • Ting Fang
    Beijing Institute of Biotechnology, 20 Dongdajie Street, Fengtai District, Beijing, China.
  • Yanli Ma
    College of Information Science and Engineering, Hebei North University, 11 Diamond South Road, Zhangjiakou 075000, China.
  • Cuicui Qiang
    BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China.
  • Yi Zhang
    Department of Thyroid Surgery, China-Japan Union Hospital of Jilin University, Jilin University, Changchun, China.
  • Yu Zhang
    College of Marine Electrical Engineering, Dalian Maritime University, Dalian, China.
  • Hong Li
    Department of Public Health Sciences, Medical College of South Carolina, Charleston, SC.
  • Yuying Yuan
    Clinical laboratory of BGI Health, BGI-Shenzhen, Shenzhen, China.