Large Language Model Empowered Privacy-Protected Framework for PHI Annotation in Clinical Notes
Journal:
arXiv
Published Date:
Apr 22, 2025
Abstract
The de-identification of private information in medical data is a crucial
process to mitigate the risk of confidentiality breaches, particularly when
patient personal details are not adequately removed before the release of
medical records. Although rule-based and learning-based methods have been
proposed, they often struggle with limited generalizability and require
substantial amounts of annotated data for effective performance. Recent
advancements in large language models (LLMs) have shown significant promise in
addressing these issues due to their superior language comprehension
capabilities. However, LLMs present challenges, including potential privacy
risks when using commercial LLM APIs and high computational costs for deploying
open-source LLMs locally. In this work, we introduce LPPA, an LLM-empowered
Privacy-Protected PHI Annotation framework for clinical notes, targeting the
English language. By fine-tuning LLMs locally with synthetic notes, LPPA
ensures strong privacy protection and high PHI annotation accuracy. Extensive
experiments demonstrate LPPA's effectiveness in accurately de-identifying
private information, offering a scalable and efficient solution for enhancing
patient privacy protection.