Noise Reduction Learning Based on XLNet-CRF for Biomedical Named Entity Recognition.

Journal: IEEE/ACM transactions on computational biology and bioinformatics
Published Date:

Abstract

In recent years, Biomedical Named Entity Recognition (BioNER) systems have mainly been based on deep neural networks, which are used to extract information from the rapidly expanding biomedical literature. Long-distance context autoencoding language models based on transformers have recently been employed for BioNER with great success. However, noise interference exists in the process of pre-training and fine-tuning, and there is no effective decoder for label dependency. Current models have many aspects in need of improvement for better performance. We propose two kinds of noise reduction models, Shared Labels and Dynamic Splicing, based on XLNet encoding which is a permutation language pre-training model and decoding by Conditional Random Field (CRF). By testing 15 biomedical named entity recognition datasets, the two models improved the average F1-score by 1.504 and 1.48, respectively, and state-of-the-art performance was achieved on 7 of them. Further analysis proves the effectiveness of the two models and the improvement of the recognition effect of CRF, and suggests the applicable scope of the models according to different data characteristics.

Authors

  • Zhaoying Chai
  • Han Jin
    School of Big Data Application and Economics, Guizhou University of Finance and Economics, Guiyang, Guizhou, China.
  • Shenghui Shi
    Chongqing Key Laboratory of Optical Fiber Sensor and Photoelectric Detection, Chongqing University of Technology, Chongqing, 400054, China.
  • Siyan Zhan
  • Lin Zhuo
  • Yu Yang
    Department of Obstetrics & Gynecology, the First Affiliated Hospital of Xi'an Jiaotong University, Xian, Shaanxi, China.
  • Qi Lian