Unlocking the Secrets Behind Advanced Artificial Intelligence Language Models in Deidentifying Chinese-English Mixed Clinical Text: Development and Validation Study.

Journal: Journal of medical Internet research
Published Date:

Abstract

BACKGROUND: The widespread use of electronic health records in the clinical and biomedical fields makes the removal of protected health information (PHI) essential to maintain privacy. However, a significant portion of information is recorded in unstructured textual forms, posing a challenge for deidentification. In multilingual countries, medical records could be written in a mixture of more than one language, referred to as code mixing. Most current clinical natural language processing techniques are designed for monolingual text, and there is a need to address the deidentification of code-mixed text.

Authors

  • You-Qian Lee
    Dialogue System Technical Department, Intelligent Robot, Asustek Computer Inc, Taipei, Taiwan.
  • Ching-Tai Chen
    Institute of Information Science, Academia Sinica, 115, Taipei City, Taiwan.
  • Chien-Chang Chen
    Bio-Microsystems Integration Laboratory, Department of Biomedical Sciences and Engineering, National Central University, Taoyuan City, Taiwan.
  • Chung-Hong Lee
    Knowledge Discovery and Data Mining Lab, Department of Electrical Engineering, College of Electrical Engineering and Computer Science, National Kaohsiung University of Science and Technology, Kaohsiung, Taiwan.
  • Peitsz Chen
    Department of Chemical Engineering, Feng Chia University, Taichung, Taiwan.
  • Chi-Shin Wu
    Department of Psychiatry, National Taiwan University Hospital, Taipei, Taiwan R.O.C.
  • Hong-Jie Dai
    Department of Computer Science and Information Engineering, National Taitung University, Taiwan. Electronic address: hjdai@nttu.edu.tw.