Ensemble-based Methods to Improve De-identification of Electronic Health Record Narratives.

Journal: AMIA ... Annual Symposium proceedings. AMIA Symposium
PMID:

Abstract

Text de-identification is an application of clinical natural language processing that offers significant efficiency and scalability advantages. Hence, various learning algorithms have been applied to this task to yield better performance. Instead of choosing the best individual learning algorithm, we aim to improve de-identification by constructing ensembles that lead to more accurate classification. We present three different ensemble methods that combine multiple de-identification models trained from deep learning, shallow learning, and rule-based approaches. Each model is capable of automated de-identification without manual medical expertise. Our experimental results show that the stacked learning ensemble is more effective than other ensemble methods, producing the highest recall, the most important metric for de-identification. The stacked ensemble achieved state-of-the-art performance on the 2014 i2b2 dataset with 97.04% precision, 94.45% recall, and 95.73% F score.

Authors

  • Youngjun Kim
  • Paul Heider
    Medical University of South Carolina, Charleston, South Carolina, USA.
  • Stephane Meystre
    Stephane Meystre, MD, PhD, is an Assistant Professor at the University of Utah and a Research Investigator in the IDEAS Center at the VA Salt Lake City Health Care System in Salt Lake City, UT.