Facilitating information extraction without annotated data using unsupervised and positive-unlabeled learning.

Journal: AMIA ... Annual Symposium proceedings. AMIA Symposium
PMID:

Abstract

Information extraction (IE), the distillation of specific information from unstructured data, is a core task in natural language processing. For rare entities (<1% prevalence), collection of positive examples required to train a model may require an infeasibly large sample of mostly negative ones. We combined unsupervised- with biased positive-unlabeled (PU) learning methods to: 1) facilitate positive example collection while maintaining the assumptions needed to 2) learn a binary classifier from the biased positive-unlabeled data alone. We tested the methods on a real-life use case of rare (<0.42%) entity extraction from medical malpractice documents. When tested on a manually reviewed random sample of documents, the PU model achieved an area under the precision-recall curve of0.283 and Fj of 0.410, outperforming fully supervised learning (0.022 and 0.096, respectively). The results demonstrate our method's potential to reduce the manual effort required for extracting rare entities from narrative texts.

Authors

  • Zfania Tom Korach
    Division of General Internal Medicine and Primary Care, Department of Medicine, Brigham and Women's Hospital, Boston, Massachusetts, United States.
  • Sharmitha Yerneni
    Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA.
  • Jonathan Einbinder
    Harvard Medical School, Boston, MA.
  • Carl Kallenberg
    CRICO Risk Management Foundation, Boston, MA.
  • Li Zhou
    School of Education, China West Normal University, Nanchong, Sichuan, China.