Agile text mining for the 2014 i2b2/UTHealth Cardiac risk factors challenge.

Journal: Journal of biomedical informatics
Published Date:

Abstract

This paper describes the use of an agile text mining platform (Linguamatics' Interactive Information Extraction Platform, I2E) to extract document-level cardiac risk factors in patient records as defined in the i2b2/UTHealth 2014 challenge. The approach uses a data-driven rule-based methodology with the addition of a simple supervised classifier. We demonstrate that agile text mining allows for rapid optimization of extraction strategies, while post-processing can leverage annotation guidelines, corpus statistics and logic inferred from the gold standard data. We also show how data imbalance in a training set affects performance. Evaluation of this approach on the test data gave an F-Score of 91.7%, one percent behind the top performing system.

Authors

  • James Cormack
    Linguamatics Ltd., 324 Cambridge Science Park, Milton Road, Cambridge CB4 0WG, UK. Electronic address: james.cormack@linguamatics.com.
  • Chinmoy Nath
    Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, 750 N. Lake Shore Drive, 11th Floor, Chicago, IL 60611, USA.
  • David Milward
    Linguamatics Ltd., 324 Cambridge Science Park, Milton Road, Cambridge CB4 0WG, UK.
  • Kalpana Raja
    Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA.
  • Siddhartha R Jonnalagadda
    Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, 750 N. Lake Shore Drive, 11th Floor, Chicago, IL 60611, USA.