Hidden Markov model using Dirichlet process for de-identification.

Journal: Journal of biomedical informatics

Published Date: Dec 1, 2015

Abstract

For the 2014 i2b2/UTHealth de-identification challenge, we introduced a new non-parametric Bayesian hidden Markov model using a Dirichlet process (HMM-DP). The model intends to reduce task-specific feature engineering and to generalize well to new data. In the challenge we developed a variational method to learn the model and an efficient approximation algorithm for prediction. To accommodate out-of-vocabulary words, we designed a number of feature functions to model such words. The results show the model is capable of understanding local context cues to make correct predictions without manual feature engineering and performs as accurately as state-of-the-art conditional random field models in a number of categories. To incorporate long-range and cross-document context cues, we developed a skip-chain conditional random field model to align the results produced by HMM-DP, which further improved the performance.

Authors

Tao Chen

School of Automation, Northwestern Polytechnical University, Xi'an, 710072, Shaanxi, China.
Richard M Cullen

Primary Healthcare Research Unit, Memorial University of Newfoundland, Canada. Electronic address: richard.cullen@med.mun.ca.
Marshall Godwin

Primary Healthcare Research Unit, Memorial University of Newfoundland, Canada. Electronic address: godwinm@mun.ca.

Keywords

Cohort Studies Computer Security Computer Simulation Confidentiality Data Mining Electronic Health Records Machine Learning Markov Chains Models, Statistical Narration Natural Language Processing Newfoundland and Labrador Pattern Recognition, Automated Vocabulary, Controlled

External Resources

View on PubMed Access via DOI PubMed (26407642)

Hidden Markov model using Dirichlet process for de-identification.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals