Design of a generic, open platform for machine learning-assisted indexing and clustering of articles in PubMed, a biomedical bibliographic database.

Journal: Data and information management
Published Date:

Abstract

Many investigators have carried out text mining of the biomedical literature for a variety of purposes, ranging from the assignment of indexing terms to the disambiguation of author names. A common approach is to define positive and negative training examples, extract features from article metadata, and employ machine learning algorithms. At present, each research group tackles each problem from scratch, and in isolation of other projects, which causes redundancy and great waste of effort. Here, we propose and describe the design of a generic platform for biomedical text mining, which can serve as a shared resource for machine learning projects, and can serve as a public repository for their outputs. We will initially focus on a specific goal, namely, classifying articles according to Publication Type, and emphasize how feature sets can be made more powerful and robust through the use of multiple, heterogeneous similarity measures as input to machine learning models. We then discuss how the generic platform can be extended to include a wide variety of other machine learning based goals and projects, and can be used as a public platform for disseminating the results of NLP tools to end-users as well.

Authors

  • Neil R Smalheiser
    Department of Psychiatry and Psychiatric Institute, University of Illinois College of Medicine, 1601 West Taylor Street, MC912, Chicago, IL 60612 neils@uic.edu +1-708-312-413-4581.
  • Aaron M Cohen
    Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, Oregon, USA 97239.

Keywords

No keywords available for this article.