A systematic comparison of feature space effects on disease classifier performance for phenotype identification of five diseases.

Journal: Journal of biomedical informatics

Published Date: Aug 1, 2015

Abstract

Automated phenotype identification plays a critical role in cohort selection and bioinformatics data mining. Natural Language Processing (NLP)-informed classification techniques can robustly identify phenotypes in unstructured medical notes. In this paper, we systematically assess the effect of naive, lexically normalized, and semantic feature spaces on classifier performance for obesity, atherosclerotic cardiovascular disease (CAD), hyperlipidemia, hypertension, and diabetes. We train support vector machines (SVMs) using individual feature spaces as well as combinations of these feature spaces on two small training corpora (730 and 790 documents) and a combined (1520 documents) training corpus. We assess the importance of feature spaces and training data size on SVM model performance. We show that inclusion of semantically-informed features does not statistically improve performance for these models. The addition of training data has weak effects of mixed statistical significance across disease classes suggesting larger corpora are not necessary to achieve relatively high performance with these models.

Authors

Christopher Kotfila

Department of Information Studies, State University of New York at Albany, Albany, NY, USA.
Ozlem Uzuner

Department of Information Studies, University at Albany, SUNY. Albany, NY.

Keywords

Cardiovascular Diseases Data Mining Decision Support Systems, Clinical Diabetes Mellitus Diagnosis, Computer-Assisted Electronic Health Records Humans Natural Language Processing New York Obesity Pattern Recognition, Automated Phenotype Reproducibility of Results Sensitivity and Specificity Support Vector Machine

External Resources

View on PubMed Access via DOI PubMed (26241355)

A systematic comparison of feature space effects on disease classifier performance for phenotype identification of five diseases.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals