Automatic de-identification of electronic medical records using token-level and character-level conditional random fields.

Journal: Journal of biomedical informatics
Published Date:

Abstract

De-identification, identifying and removing all protected health information (PHI) present in clinical data including electronic medical records (EMRs), is a critical step in making clinical data publicly available. The 2014 i2b2 (Center of Informatics for Integrating Biology and Bedside) clinical natural language processing (NLP) challenge sets up a track for de-identification (track 1). In this study, we propose a hybrid system based on both machine learning and rule approaches for the de-identification track. In our system, PHI instances are first identified by two (token-level and character-level) conditional random fields (CRFs) and a rule-based classifier, and then are merged by some rules. Experiments conducted on the i2b2 corpus show that our system submitted for the challenge achieves the highest micro F-scores of 94.64%, 91.24% and 91.63% under the "token", "strict" and "relaxed" criteria respectively, which is among top-ranked systems of the 2014 i2b2 challenge. After integrating some refined localization dictionaries, our system is further improved with F-scores of 94.83%, 91.57% and 91.95% under the "token", "strict" and "relaxed" criteria respectively.

Authors

  • Zengjian Liu
    Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen 518055, China. Electronic address: liuzengjian.hit@gmail.com.
  • Yangxin Chen
    Department of Cardiology, Sun Yat-sen Memorial Hospital of Sun Yat-sen University, Guangzhou 510120, China. Electronic address: tjcyx1995@163.com.
  • Buzhou Tang
  • Xiaolong Wang
    Cardiovascular Department, Shuguang Hospital Affiliated to Shanghai University of TCM Shanghai, China.
  • Qingcai Chen
    Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China.
  • Haodi Li
    Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen 518055, China. Electronic address: haodili.hit@gmail.com.
  • Jingfeng Wang
    Department of Cardiology, Sun Yat-sen Memorial Hospital of Sun Yat-sen University, Guangzhou 510120, China. Electronic address: dr_wjf@hotmail.com.
  • Qiwen Deng
    The Sixth People's Hospital of Shenzhen, Shenzhen 518052, China. Electronic address: qiwendeng@hotmail.com.
  • Suisong Zhu
    The Sixth People's Hospital of Shenzhen, Shenzhen 518052, China. Electronic address: 13809883596@163.com.