Clustering Similar Diagnosis Terms.

Journal: Studies in health technology and informatics

PMID: 37203513

Abstract

A large clinical diagnosis list is explored with the goal to cluster syntactic variants. A string similarity heuristic is compared with a deep learning-based approach. Levenshtein distance (LD) applied to common words only (not tolerating deviations in acronyms and tokens with numerals), together with pair-wise substring expansions raised F1 to 13% above baseline (plain LD), with a maximum F1 of 0.71. In contrast, the model-based approach trained on a German medical language model did not perform better than the baseline, not exceeding an F1 value of 0.42.

Authors

Stefan Schulz

Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Austria.
Akhila Abdulnazar

IMI, Medical University of Graz, Austria.
Markus Kreuzthaler

Institute of Medical Informatics, Statistics, and Documentation, Medical University of Graz, Austria.

Keywords

Cluster Analysis Electronic Health Records Language Natural Language Processing Records

External Resources

View on PubMed Access via DOI PubMed (37203513)

Clustering Similar Diagnosis Terms.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals