Optimizing Corpus Creation for Training Word Embedding in Low Resource Domains: A Case Study in Autism Spectrum Disorder (ASD).

Journal: AMIA ... Annual Symposium proceedings. AMIA Symposium

Published Date: Dec 5, 2018

Abstract

Automating the extraction of behavioral criteria indicative of Autism Spectrum Disorder (ASD) in electronic health records (EHRs) can contribute significantly to the effort to monitor the condition. Word embedding algorithms such as Word2Vec can encode semantic meanings of words in vectors and assist in automated vocabulary discovery from EHRs. However, text available for training word embeddings for ASD is miniscule compared to the billions of tokens typically used. We evaluate the importance of corpus specificity versus size and hypothesize that for specific domains small corpora can generate excellent word embeddings. We custom-built 6 ASD-themed corpora (N=4482), using ASD EHRs and abstracts from PubMed (N=39K) and PsychInfo (N=69K) and evaluated them. We were able to generate the most useful 200-dimension embeddings based on the small ASD EHR data. Due to diversity in its vocabulary, the abstract-based embeddings generated fewer related terms and saw minimal improvement when the size of the corpus increased.

Authors

Yang Gu

University of Arizona, Tucson, Arizona.
Gondy Leroy

University of Arizona, Tucson, AZ, United States.
Sydney Pettygrove

University of Arizona, Tucson, Arizona.
Maureen Kelly Galindo

University of Arizona, Tucson, Arizona.
Margaret Kurzius-Spencer

University of Arizona, Tucson, Arizona.

Keywords

Algorithms Autism Spectrum Disorder Electronic Health Records Humans Information Storage and Retrieval Machine Learning Natural Language Processing Semantics Terminology as Topic

External Resources

View on PubMed PubMed (30815091)

Optimizing Corpus Creation for Training Word Embedding in Low Resource Domains: A Case Study in Autism Spectrum Disorder (ASD).

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals