Extracting and classifying diagnosis dates from clinical notes: A case study.

Journal: Journal of biomedical informatics
Published Date:

Abstract

Myeloproliferative neoplasms (MPNs) are chronic hematologic malignancies that may progress over long disease courses. The original date of diagnosis is an important piece of information for patient care and research, but is not consistently documented. We describe an attempt to build a pipeline for extracting dates with natural language processing (NLP) tools and techniques and classifying them as relevant diagnoses or not. Inaccurate and incomplete date extraction and interpretation impacted the performance of the overall pipeline. Existing lightweight Python packages tended to have low specificity for identifying and interpreting partial and relative dates in clinical text. A rules-based regular expression (regex) approach achieved recall of 83.0% on dates manually annotated as diagnosis dates, and 77.4% on all annotated dates. With only 3.8% of annotated dates representing initial MPN diagnoses, additional methods of targeting candidate date instances may alleviate noise and class imbalance.

Authors

  • Julia T Fu
    Department of Health Policy and Research, Weill Cornell Medicine, 402 E. 67th St, New York, NY 10065, United States; Division of Health Informatics, Memorial Sloan Kettering Cancer Center, 600 3rd Ave, 8th Fl, New York, NY 10016, United States. Electronic address: juf3004@alumni.weill.cornell.edu.
  • Evan Sholle
    Information Technologies & Services Department, Weill Cornell Medicine, New York, New York, United States of America.
  • Spencer Krichevsky
    Stony Brook University, Department of Biomedical Informatics, Stony Brook, NY, USA.
  • Joseph Scandura
    Department of Hematology and Oncology, Weill Cornell Medicine, 428 E 72nd St, Ste 300, New York, NY 10065, United States. Electronic address: jms2003@med.cornell.edu.
  • Thomas R Campion
    Information Technologies and Services Department, Weill Cornell Medicine, New York, NY.