PISTON: Predicting drug indications and side effects using topic modeling and natural language processing.

Journal: Journal of biomedical informatics
Published Date:

Abstract

The process of discovering novel drugs to treat diseases requires a long time and high cost. It is important to understand side effects of drugs as well as their therapeutic effects, because these can seriously damage the patients due to unexpected actions of the derived candidate drugs. In order to overcome these limitations, computational methods for predicting the therapeutic effects and side effects have been proposed. In particular, text mining is a widely used technique in the field of systems biology, because it can discover hidden relationships between drugs, genes and diseases from a large amount of literature data. Compared with in vivo/in vitro experiments, text mining derives meaningful results with less time and cost. In this study, we propose an algorithm for predicting novel drug-phenotype associations and drug-side effect associations using topic modeling and natural language processing (NLP). We extract sentences in which drugs and genes co-occur from the abstracts of the literature and identify words that describe the relationship between them using NLP. Considering the characteristics of the identified words, we determine if the drug has an up-regulation effect or a down-regulation effect on the gene. Based on genes that affect drugs and their regulatory relationships, we group the frequently occurring genes and regulatory relationships into topics, and build a drug-topic probability matrix by calculating the score that the drug will have a topic using topic modeling. Using the matrix, a classifier is constructed for predicting the novel indications and side effects of drugs considering the characteristics of known drug-phenotype associations or drug-side effect associations. The proposed method predicts both indications and side effects with a single algorithm, and it can exclude drugs with serious side effects or side effects that patients do not want to experience from among the candidate drugs provided for the treatment of the phenotype. Furthermore, lists of novel candidate drugs for phenotypes and side effects can be continuously updated with our algorithm every time a document is added. More than a thousand documents are produced per day, and it is possible for our algorithm to efficiently derive candidate drugs because it requires less cost than the existing drug repositioning methods. The resource of PISTON is available at databio.gachon.ac.kr/tools/PISTON.

Authors

  • Giup Jang
    Department of IT Convergence Engineering, Gachon University, Seongnam, Republic of Korea.
  • Taekeon Lee
    Department of Computer Engineering, Gachon University, Seongnam, Republic of Korea.
  • Soyoun Hwang
    Department of IT Convergence Engineering, Gachon University, Seongnam, Republic of Korea.
  • Chihyun Park
    Dept. of Computer Science, Yonsei University, Seodaemun-gu, Seoul, Korea.
  • Jaegyoon Ahn
    Department of Integrative Biology and Physiology, University of California, Los Angeles, USA. Electronic address: jgahn@ucla.edu.
  • Sukyung Seo
    Department of Computer Engineering, Gachon University, Seongnam, Republic of Korea.
  • Youhyeon Hwang
    Department of Computer Science, University of Southern California, Los Angeles, USA.
  • Youngmi Yoon
    Department of Computer Engineering, Gachon University, South Korea. Electronic address: ymyoon@gachon.ac.kr.