Automatic de-identification of French electronic health records: a cost-effective approach exploiting distant supervision and deep learning models.

Journal: BMC medical informatics and decision making
PMID:

Abstract

BACKGROUND: Electronic health records (EHRs) contain valuable information for clinical research; however, the sensitive nature of healthcare data presents security and confidentiality challenges. De-identification is therefore essential to protect personal data in EHRs and comply with government regulations. Named entity recognition (NER) methods have been proposed to remove personal identifiers, with deep learning-based models achieving better performance. However, manual annotation of training data is time-consuming and expensive. The aim of this study was to develop an automatic de-identification pipeline for all kinds of clinical documents based on a distant supervised method to significantly reduce the cost of manual annotations and to facilitate the transfer of the de-identification pipeline to other clinical centers.

Authors

  • Mohamed El Azzouzi
    Univ Rennes, INSERM, LTSI-UMR 1099, F-35000, Rennes, France. mohamed.elazzouzi@univ-rennes.fr.
  • Gouenou Coatrieux
    IMT Atlantique, INSERM, LATIM - UMR 1101, Brest, F-29238, France.
  • Reda Bellafqira
    IMT Atlantique, INSERM, LATIM - UMR 1101, Brest, F-29238, France.
  • Denis Delamarre
    CHU Rennes, Centre de Données Cliniques, Rennes, F-35000, France.
  • Christine Riou
    CHU Rennes, Centre de Données Cliniques, Rennes, F-35000, France.
  • Naima Oubenali
    Faculté Ingénierie et Management de la Santé, Univ. Lille, 59000, Lille, France. naimaoubenali@gmail.com.
  • Sandie Cabon
  • Marc Cuggia
    Univ Rennes, CHU Rennes, Inserm, LTSI - UMR 1099, F-35000 Rennes, France.
  • Guillaume Bouzille
    Univ Rennes, CHU Rennes, Inserm, LTSI - UMR 1099, F-35000 Rennes, France.