Identifying protected health information by transformers-based deep learning approach in Chinese medical text.
Journal:
Health informatics journal
PMID:
39862116
Abstract
In the context of Chinese clinical texts, this paper aims to propose a deep learning algorithm based on Bidirectional Encoder Representation from Transformers (BERT) to identify privacy information and to verify the feasibility of our method for privacy protection in the Chinese clinical context. We collected and double-annotated 33,017 discharge summaries from 151 medical institutions on a municipal regional health information platform, developed a BERT-based Bidirectional Long Short-Term Memory Model (BiLSTM) and Conditional Random Field (CRF) model, and tested the performance of privacy identification on the dataset. To explore the performance of different substructures of the neural network, we created five additional baseline models and evaluated the impact of different models on performance. Based on the annotated data, the BERT model pre-trained with the medical corpus showed a significant performance improvement to the BiLSTM-CRF model with a micro-recall of 0.979 and an F1 value of 0.976, which indicates that the model has promising performance in identifying private information in Chinese clinical texts. The BERT-based BiLSTM-CRF model excels in identifying privacy information in Chinese clinical texts, and the application of this model is very effective in protecting patient privacy and facilitating data sharing.