A Frequency-based Strategy of Obtaining Sentences from Clinical Data Repository for Crowdsourcing.

Journal: Studies in health technology and informatics
Published Date:

Abstract

In clinical NLP, one major barrier to adopting crowdsourcing for NLP annotation is the issue of confidentiality for protected health information (PHI) in clinical narratives. In this paper, we investigated the use of a frequency-based approach to extract sentences without PHI. Our approach is based on the assumption that sentences appearing frequently tend to contain no PHI. Both manual and automatic evaluations on 500 sentences out of the 7.9 million sentences of frequencies higher than one show that no PHI can be found among them. The promising results provide potentials of releasing those sentences for obtaining sentence-level NLP annotations via crowdsourcing.

Authors

  • Dingcheng Li
    These authors contributed equally to this study and Dr. Li is now working at IBM; Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA.
  • Majid Rastegar Mojarad
    Mayo Clinic, Rochester, MN, USA.
  • Yanpeng Li
    Mayo Clinic, Rochester, MN, USA.
  • Sunghwan Sohn
    Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, USA.
  • Saeed Mehrabi
    Secure Exchange Solution, Rockville, MD.
  • Ravikumar Komandur Elayavilli
    Mayo Clinic, Rochester, MN, USA.
  • Yue Yu
    Department of Mathematics, Lehigh University, Bethlehem, PA, USA.
  • Hongfang Liu
    Department of Artificial Intelligence & Informatics, Mayo Clinic, Rochester, MN, United States.