CRFs based de-identification of medical records.

Journal: Journal of biomedical informatics
Published Date:

Abstract

De-identification is a shared task of the 2014 i2b2/UTHealth challenge. The purpose of this task is to remove protected health information (PHI) from medical records. In this paper, we propose a novel de-identifier, WI-deId, based on conditional random fields (CRFs). A preprocessing module, which tokenizes the medical records using regular expressions and an off-the-shelf tokenizer, is introduced, and three groups of features are extracted to train the de-identifier model. The experiment shows that our system is effective in the de-identification of medical records, achieving a micro-F1 of 0.9232 at the i2b2 strict entity evaluation level.

Authors

  • Bin He
    Clinical Translational Medical Center, The Affiliated Dongguan Songshan Lake Central Hospital, Guangdong Medical University, Dongguan, Guangdong, China.
  • Yi Guan
    School of Computer Science and Technology, Harbin Institute of Technology, Integrated Laboratory Building 803, Harbin 150001, China. Electronic address: guanyi@hit.edu.cn.
  • Jianyi Cheng
    School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.
  • Keting Cen
    School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.
  • Wenlan Hua
    School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.