Partial Annotation Learning for Biomedical Entity Recognition.

Journal: IEEE journal of biomedical and health informatics
PMID:

Abstract

Named Entity Recognition (NER) is a key task to support biomedical research. In Biomedical Named Entity Recognition (BioNER), obtaining high-quality expert annotated data is laborious and expensive, leading to the development of automatic approaches such as distant supervision. However, manually and automatically generated data often suffer from the unlabeled entity problem, whereby many entity annotations are missing, degrading the performance of full annotation NER models. To conquer this issue, we undertake a systematic exploration of the efficacy of partial annotation learning methods for BioNER, which encompasses a comprehensive evaluation conducted across a spectrum of distinct simulated scenarios of missing entity annotations. Furthermore, we propose a TS-PubMedBERT-Partial-CRF partial annotation learning model. We standardize a compilation of 16 BioNER corpora, encompassing a range of five distinct entity types, to establish a gold standard. And we compare against the state-of-the-art partial annotation model EER-PubMedBERT, the widely acknowledged partial annotation model BiLSTM-Partial-CRF model, and the state-of-the-art full annotation learning BioNER model PubMedBERT tagger. Results show that partial annotation learning-based methods can effectively learn from biomedical corpora with missing entity annotations. Our proposed model outperforms alternatives and, specifically, the PubMedBERT tagger by 38% in F1-score under high missing entity rates. Moreover, the recall of entity mentions in our model demonstrates a competitive alignment with the upper threshold observed on the fully annotated dataset.

Authors

  • Liangping Ding
  • Giovanni Colavizza
  • Zhixiong Zhang
    Department of Electrical Engineering and Computer Science, University of Michigan, 1301 Beal Avenue, Ann Arbor, MI 48109-2122, USA. zhangzx@umich.edu esyoon@umich.edu.