An unsupervised and customizable misspelling generator for mining noisy health-related text sources.

Journal: Journal of biomedical informatics
Published Date:

Abstract

BACKGROUND: Data collection and extraction from noisy text sources such as social media typically rely on keyword-based searching/listening. However, health-related terms are often misspelled in such noisy text sources due to their complex morphology, resulting in the exclusion of relevant data for studies. In this paper, we present a customizable data-centric system that automatically generates common misspellings for complex health-related terms, which can improve the data collection process from noisy text sources.

Authors

  • Abeed Sarker
    Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, United States.
  • Graciela Gonzalez-Hernandez
    Health Language Processing Center, Institute for Biomedical Informatics at the Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA.