LLMs-in-the-Loop Part 2: Expert Small AI Models for Anonymization and De-identification of PHI Across Multiple Languages
Journal:
arXiv
Published Date:
Dec 14, 2024
Abstract
The rise of chronic diseases and pandemics like COVID-19 has emphasized the
need for effective patient data processing while ensuring privacy through
anonymization and de-identification of protected health information (PHI).
Anonymized data facilitates research without compromising patient
confidentiality. This paper introduces expert small AI models developed using
the LLM-in-the-loop methodology to meet the demand for domain-specific
de-identification NER models. These models overcome the privacy risks
associated with large language models (LLMs) used via APIs by eliminating the
need to transmit or store sensitive data. More importantly, they consistently
outperform LLMs in de-identification tasks, offering superior performance and
reliability. Our de-identification NER models, developed in eight languages
(English, German, Italian, French, Romanian, Turkish, Spanish, and Arabic)
achieved f1-micro score averages of 0.966, 0.975, 0.976, 0.970, 0.964, 0.974,
0.978, and 0.953 respectively. These results establish them as the most
accurate healthcare anonymization solutions, surpassing existing small models
and even general-purpose LLMs such as GPT-4o. While Part-1 of this series
introduced the LLM-in-the-loop methodology for bio-medical document
translation, this second paper showcases its success in developing
cost-effective expert small NER models in de-identification tasks. Our findings
lay the groundwork for future healthcare AI innovations, including biomedical
entity and relation extraction, demonstrating the value of specialized models
for domain-specific challenges.