Comparison of Grouping Methods for Template Extraction from VA Medical Record Text.

Journal: Studies in health technology and informatics
PMID:

Abstract

We investigate options for grouping templates for the purpose of template identification and extraction from electronic medical records. We sampled a corpus of 1000 documents originating from Veterans Health Administration (VA) electronic medical record. We grouped documents through hashing and binning tokens (Hashed) as well as by the top 5% of tokens identified as important through the term frequency inverse document frequency metric (TF-IDF). We then compared the approaches on the number of groups with 3 or more and the resulting longest common subsequences (LCSs) common to all documents in the group. We found that the Hashed method had a higher success rate for finding LCSs, and longer LCSs than the TF-IDF method, however the TF-IDF approach found more groups than the Hashed and subsequently more long sequences, however the average length of LCSs were lower. In conclusion, each algorithm appears to have areas where it appears to be superior.

Authors

  • Andrew M Redd
    VA Salt Lake City Health Care System, Salt Lake City, UT, United States; Department of Internal Medicine, University of Utah School of Medicine, Salt Lake City, UT, United States. Electronic address: andrew.redd@hsc.utah.edu.
  • Adi V Gundlapalli
    School of Medicine, University of Utah, Salt Lake City, Utah, US.
  • Guy Divita
    VA Salt Lake City Health Care System, Salt Lake City, Utah, USA.
  • Le-Thuy Tran
    VA Salt Lake City Health Care System, Salt Lake City, UT, United States; Department of Biomedical Informatics, University of Utah School of Medicine, Salt Lake City, UT, United States.
  • Warren B P Pettey
    VA Salt Lake City Health Care System & University of Utah, Salt Lake City, UT, USA.
  • Matthew H Samore
    VA Salt Lake City Health Care System, Salt Lake City, Utah, USA.