Traditional Chinese medicine clinical records classification with BERT and domain specific corpora.

Journal: Journal of the American Medical Informatics Association : JAMIA
Published Date:

Abstract

Traditional Chinese Medicine (TCM) has been developed for several thousand years and plays a significant role in health care for Chinese people. This paper studies the problem of classifying TCM clinical records into 5 main disease categories in TCM. We explored a number of state-of-the-art deep learning models and found that the recent Bidirectional Encoder Representations from Transformers can achieve better results than other deep learning models and other state-of-the-art methods. We further utilized an unlabeled clinical corpus to fine-tune the BERT language model before training the text classifier. The method only uses Chinese characters in clinical text as input without preprocessing or feature engineering. We evaluated deep learning models and traditional text classifiers on a benchmark data set. Our method achieves a state-of-the-art accuracy 89.39% ± 0.35%, Macro F1 score 88.64% ± 0.40% and Micro F1 score 89.39% ± 0.35%. We also visualized attention weights in our method, which can reveal indicative characters in clinical text.

Authors

  • Liang Yao
    Northwestern University, Chicago, IL.
  • Zhe Jin
    Zhejiang University, College of Computer Science and Technology, Hangzhou, China.
  • Chengsheng Mao
    Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, 60611, USA.
  • Yin Zhang
    Department of Radiation Oncology, Rutgers-Cancer Institute of New Jersey, Rutgers-Robert Wood Johnson Medical School, New Brunswick, NJ, United States.
  • Yuan Luo
    Department of Preventive Medicine, Northwestern University, Feinberg School of Medicine, Chicago, IL 60611, USA.