De-identification of clinical notes with pseudo-labeling using regular expression rules and pre-trained BERT.

Journal: BMC medical informatics and decision making
PMID:

Abstract

BACKGROUND: De-identification of clinical notes is essential to utilize the rich information in unstructured text data in medical research. However, only limited work has been done in removing personal information from clinical notes in Korea.

Authors

  • Jiyong An
    Graduate School of Data Science, Seoul National University, Seoul, South Korea.
  • Jiyun Kim
    Department of Materials Science and Engineering, Ulsan National Institute of Science and Technology (UNIST), Ulsan, South Korea. jiyunkim@unist.ac.kr.
  • Leonard Sunwoo
    Department of Radiology, Seoul National University Bundang Hospital, 82, Gumi-ro 173 Beon-gil, Bundang-gu, Seongnam-si, Gyeonggi-do 13620, Republic of Korea.
  • Hyunyoung Baek
    Healthcare ICT Research Center, Office of eHealth Research and Businesses, Seoul National University Bundang Hospital, Seongnam, South Korea.
  • Sooyoung Yoo
    Office of eHealth Research and Business, Seoul National University Bundang Hospital, Seongnam, Republic of Korea.
  • Seunggeun Lee
    Graduate School of Data Science, Seoul National University, Seoul, South Korea. lee7801@snu.ac.kr.