De-identification of Clinical Text via Bi-LSTM-CRF with Neural Language Models.

Journal: AMIA ... Annual Symposium proceedings. AMIA Symposium
Published Date:

Abstract

De-identification of clinical text, the prerequisite of electronic clinical data reuse, is a typical named entity recogni tion (NER) problem. A number of state-of-the-art deep learning methods for NER, such as Bi-LSTM-CRF (bidirec tional long-short-term-memory conditional random fields), have been applied for de-identification. Neural language models used for language representation bring great improvement in lots of NLP tasks when they are integrated with other deep learning methods. In this paper, we introduce Bi-LSTM-CRF with neural language models for de- identification of clinical text, and evaluate it on the de-identification datasets of the i2b2 2014 and the CEGS N- GRID 2016 challenges. Four neural language models of three types individually integrated with Bi-LSTM-CRF are compared in this study. Bi-LSTM-CRF with neural language models achieves the highest "strict" micro-averaged F1-score of 95.50% on the i2b2 2014 dataset and 91.82% on the CEGS N-GRID 2016 dataset, becoming new benchmark results on these two datasets respectively De-identification, Named entity recognition, Bidirectional long-short-term-memory, Conditional ran dom fields, Neural language models.

Authors

  • Buzhou Tang
  • Dehuan Jiang
    Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Tech-nology, Shenzhen, China.
  • Qingcai Chen
    Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China.
  • Xiaolong Wang
    Cardiovascular Department, Shuguang Hospital Affiliated to Shanghai University of TCM Shanghai, China.
  • Jun Yan
    Department of Statistics, University of Connecticut, Storrs, CT 06269, USA.
  • Ying Shen