Clustering Similar Diagnosis Terms.

Journal: Studies in health technology and informatics
PMID:

Abstract

A large clinical diagnosis list is explored with the goal to cluster syntactic variants. A string similarity heuristic is compared with a deep learning-based approach. Levenshtein distance (LD) applied to common words only (not tolerating deviations in acronyms and tokens with numerals), together with pair-wise substring expansions raised F1 to 13% above baseline (plain LD), with a maximum F1 of 0.71. In contrast, the model-based approach trained on a German medical language model did not perform better than the baseline, not exceeding an F1 value of 0.42.

Authors

  • Stefan Schulz
    Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Austria.
  • Akhila Abdulnazar
    IMI, Medical University of Graz, Austria.
  • Markus Kreuzthaler
    Institute of Medical Informatics, Statistics, and Documentation, Medical University of Graz, Austria.