Contextualized race and ethnicity annotations for clinical text from MIMIC-III.

Journal: Scientific data
PMID:

Abstract

Observational health research often relies on accurate and complete race and ethnicity (RE) patient information, such as characterizing cohorts, assessing quality/performance metrics of hospitals and health systems, and identifying health disparities. While the electronic health record contains structured data such as accessible patient-level RE data, it is often missing, inaccurate, or lacking granular details. Natural language processing models can be trained to identify RE in clinical text which can supplement missing RE data in clinical data repositories. Here we describe the Contextualized Race and Ethnicity Annotations for Clinical Text (C-REACT) Dataset, which comprises 12,000 patients and 17,281 sentences from their clinical notes in the MIMIC-III dataset. Using these sentences, two sets of reference standard annotations for RE data are made available with annotation guidelines. The first set of annotations comprise highly granular information related to RE, such as preferred language and country of origin, while the second set contains RE labels annotated by physicians. This dataset can support health systems' ability to use RE data to serve health equity goals.

Authors

  • Oliver J Bear Don't Walk
    University of Washington, Seattle, Washington, USA. obdw4@uw.edu.
  • Adrienne Pichon
    2 Columbia University, New York, New York.
  • Harry Reyes Nieva
    Columbia University Irving Medical Center, New York, New York, USA.
  • Tony Sun
    Department of Biomedical Informatics, Columbia University, New York, New York, USA.
  • Jaan Li
    One Fact Foundation, Claymont, Delaware, USA.
  • Josh Joseph
    Harvard Medical School, Boston, Massachusetts, USA.
  • Sivan Kinberg
    Columbia University Irving Medical Center, New York, New York, USA.
  • Lauren R Richter
    Columbia University Irving Medical Center, New York, New York, USA.
  • Salvatore Crusco
    Columbia University Irving Medical Center, New York, New York, USA.
  • Kyle Kulas
    Columbia University Irving Medical Center, New York, New York, USA.
  • Shaan A Ahmed
    Columbia University Irving Medical Center, New York, New York, USA.
  • Daniel Snyder
    Columbia University Irving Medical Center, New York, New York, USA.
  • Ashkon Rahbari
    Columbia University Irving Medical Center, New York, New York, USA.
  • Benjamin L Ranard
    Division of Pulmonary, Allergy, and Critical Care Medicine, Department of Medicine, Columbia University Vagelos College of Physicians and Surgeons and NewYork-Presbyterian Hospital, New York, NY, USA; Program for Hospital and Intensive Care Informatics, Columbia University Vagelos College of Physicians and Surgeons, New York, NY, USA. Electronic address: blr2152@cumc.columbia.edu.
  • Pallavi Juneja
    Columbia University Irving Medical Center, New York, New York, USA.
  • Dina Demner-Fushman
    Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, MD.
  • NoĆ©mie Elhadad
    Biomedical Informatics, Columbia University, New York, NY, USA.