A textual dataset of de-identified health records in Spanish and Catalan for medical entity recognition and anonymization.

Journal: Scientific data
Published Date:

Abstract

The advancement of clinical natural language processing systems is crucial to exploit the wealth of textual data contained in medical records. Diverse data sources are required in different languages and from different sites to represent global health services. To this end, we have released CARMEN-I, a corpus of anonymized clinical records from the Hospital Clinic of Barcelona written during the COVID-19 pandemic spanning a period of two years. In addition to COVID-19 cases of adult patients, CARMEN-I features multiple comorbidities such as cardiovascular conditions, oncology treatments, post-transplant complications, and infectious diseases. This resource is publicly accessible together with detailed annotation guidelines and granular text-bound annotations generated in a collaborative effort between clinicians, linguists, and engineers to enable training and evaluation of automatic anonymization systems. Moreover, for information extraction purposes, a subset of 500 records is annotated with six relevant clinical concept classes: diseases, symptoms, procedures, medications, pathogens and humans.

Authors

  • Salvador Lima-López
    NLP for Biomedical Information Analysis Unit, Barcelona Supercomputing Center, Barcelona, 08034, Spain.
  • Eulàlia Farré-Maduell
    NLP for Biomedical Information Analysis Unit, Barcelona Supercomputing Center, Barcelona, 08034, Spain.
  • Luis Gasco
    NLP for Biomedical Information Analysis Unit, Barcelona Supercomputing Center, Barcelona, 08034, Spain.
  • Jan Rodríguez-Miret
    NLP for Biomedical Information Analysis Unit, Barcelona Supercomputing Center, Barcelona, 08034, Spain.
  • Santiago Frid
    Clinical Informatics Service, Hospital Clínic de Barcelona. 08036 - Barcelona, Spain.
  • Xavier Pastor
    Department of Medical Informatics, Hospital Clinic of Barcelona-University of Barcelona, Barcelona, Spain.
  • Xavier Borrat
    Clinical Informatics, Hospital Clinic, Barcelona, 08036, Spain. xborrat@clinic.cat.
  • Martin Krallinger
    Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre, Madrid, Spain.