An unsupervised and customizable misspelling generator for mining noisy health-related text sources.
Journal:
Journal of biomedical informatics
Published Date:
Nov 13, 2018
Abstract
BACKGROUND: Data collection and extraction from noisy text sources such as social media typically rely on keyword-based searching/listening. However, health-related terms are often misspelled in such noisy text sources due to their complex morphology, resulting in the exclusion of relevant data for studies. In this paper, we present a customizable data-centric system that automatically generates common misspellings for complex health-related terms, which can improve the data collection process from noisy text sources.