A flexible two-stage anonymization framework for narrative medical records adapting to various language models.
Journal:
Computers in biology and medicine
Published Date:
Jun 23, 2025
Abstract
The healthcare sector increasingly relies on Electronic Health Records (EHRs) for efficient and high-quality patient care by providing rapid access to comprehensive medical information. However, these records contain sensitive patient data that must be protected, especially when transferred to cloud environments. Identifying and anonymizing this sensitive information is challenging due to its dispersion across multiple words or phrases in narrative unstructured text. To systematically detect and anonymize unstructured narrative digital medical records, a two-stage k-anonymization framework, combining natural language processing (NLP) methods and privacy-preserving techniques has been proposed in this study. The first stage is to extract the sensitive entities from narrative medical records according to identifiers predefined by existing privacy rules, and the second stage is to generate perturbed data that satisfies k-anonymity. Fine-tuned Bidirectional Encoder Representations from Transformer (BERT) models and prompt-driven Large Language Models (LLMs) were developed and customized in this framework. Experimental results demonstrate that our framework achieves high F1-scores of over 90 % across multiple entity types and the two-stage structure allows for dynamic adjustment of entity categories and anonymization strategies to comply with various privacy regulations. Recognizing the limitations of healthcare environments with minimal computational resources, the proposed framework was optimized for deployment on standard consumer-grade computers with widely available GPUs by using Low-Rank Adaptation (LoRA) instead of full fine-tuning to reduce memory consumption, making it suitable for both large-scale and resource-constrained environments.