Pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification.

Journal: Database : the journal of biological databases and curation
Published Date:

Abstract

Large volumes of publications are being produced in biomedical sciences nowadays with ever-increasing speed. To deal with the large amount of unstructured text data, effective natural language processing (NLP) methods need to be developed for various tasks such as document classification and information extraction. BioCreative Challenge was established to evaluate the effectiveness of information extraction methods in biomedical domain and facilitate their development as a community-wide effort. In this paper, we summarize our work and what we have learned from the latest round, BioCreative Challenge VII, where we participated in all five tracks. Overall, we found three key components for achieving high performance across a variety of NLP tasks: (1) pre-trained NLP models; (2) data augmentation strategies and (3) ensemble modelling. These three strategies need to be tailored towards the specific tasks at hands to achieve high-performing baseline models, which are usually good enough for practical applications. When further combined with task-specific methods, additional improvements (usually rather small) can be achieved, which might be critical for winning competitions. Database URL: https://doi.org/10.1093/database/baac066.

Authors

  • Arslan Erdengasileng
    Department of Statistics, Florida State University, Tallahassee, FL 32306, USA.
  • Qing Han
    Engineering College, Honghe University, Honghe Yunnan, China.
  • Tingting Zhao
    School of Software Engineering, Beihang University, Beijing, China.
  • Shubo Tian
    Department of Statistics, Florida State University, Tallahassee, FL 32306, USA.
  • Xin Sui
    Department of Radiology, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences, No. 1 Shuaifuyuan Wangfujing Dongcheng District, Beijing, 100730, China.
  • Keqiao Li
    Department of Statistics, Florida State University, Tallahassee, FL 32306, USA.
  • Wanjing Wang
    Department of Statistics, Florida State University, Tallahassee, FL 32306, USA.
  • Jian Wang
    Veterinary Diagnostic Center, Shanghai Animal Disease Control Center, Shanghai, China.
  • Ting Hu
    Memorial University of Newfoundland, St. John's, Canada.
  • Feng Pan
    Department of Radiation Oncology, China-Japan Union Hospital of Jilin University, Changchun, China.
  • Yuan Zhang
    Department of Urology, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China.
  • Jinfeng Zhang
    Department of Statistics, Florida State University, Tallahassee, FL, 32306, USA. jinfeng@stat.fsu.edu.