Phrase mining of textual data to analyze extracellular matrix protein patterns across cardiovascular disease.

Journal: American journal of physiology. Heart and circulatory physiology
PMID:

Abstract

Extracellular matrix (ECM) proteins have been shown to play important roles regulating multiple biological processes in an array of organ systems, including the cardiovascular system. Using a novel bioinformatics text-mining tool, we studied six categories of cardiovascular disease (CVD), namely, ischemic heart disease, cardiomyopathies, cerebrovascular accident, congenital heart disease, arrhythmias, and valve disease, anticipating novel ECM protein-disease and protein-protein relationships hidden within vast quantities of textual data. We conducted a phrase-mining analysis, delineating the relationships of 709 ECM proteins with the 6 groups of CVDs reported in 1,099,254 abstracts. The technology pipeline known as Context-Aware Semantic Online Analytical Processing was applied to semantically rank the association of proteins to each CVD and all six CVDs, performing analyses to quantify each protein-disease relationship. We performed principal component analysis and hierarchical clustering of the data, where each protein was visualized as a six-dimensional vector. We found that ECM proteins display variable degrees of association with the six CVDs; certain CVDs share groups of associated proteins, whereas others have divergent protein associations. We identified 82 ECM proteins sharing associations with all 6 CVDs. Our bioinformatics analysis ascribed distinct ECM pathways (via Reactome) from this subset of proteins, namely, insulin-like growth factor regulation and interleukin-4 and interleukin-13 signaling, suggesting their contribution to the pathogenesis of all six CVDs. Finally, we performed hierarchical clustering analysis and identified protein clusters predominantly associated with a targeted CVD; analyses of these proteins revealed unexpected insights underlying the key ECM-related molecular pathogenesis of each CVD, including virus assembly and release in arrhythmias. NEW & NOTEWORTHY The present study is the first application of a text-mining algorithm to characterize the relationships of 709 extracellular matrix-related proteins with 6 categories of cardiovascular disease described in 1,099,254 abstracts. Our analysis informed unexpected extracellular matrix functions, pathways, and molecular relationships implicated in the six cardiovascular diseases.

Authors

  • David A Liem
    NIH BD2K Program Centers of Excellence for Big Data Computing-Heart BD2K Center, Departments of Physiology, Medicine/Cardiology, and Bioinformatics, David Geffen School of Medicine, University of California , Los Angeles, California.
  • Sanjana Murali
    NIH BD2K Program Centers of Excellence for Big Data Computing-Heart BD2K Center, Departments of Physiology, Medicine/Cardiology, and Bioinformatics, David Geffen School of Medicine, University of California , Los Angeles, California.
  • Dibakar Sigdel
    NIH BD2K Program Centers of Excellence for Big Data Computing-Heart BD2K Center, Departments of Physiology, Medicine/Cardiology, and Bioinformatics, David Geffen School of Medicine, University of California , Los Angeles, California.
  • Yu Shi
    NIH BD2K Program Centers of Excellence for Big Data Computing-KnowEng Center, Department of Computer Science, University of Illinois at Urbana-Champaign , Champaign, Illinois.
  • Xuan Wang
    Baylor Scott & White Health, Dallas, TX, USA.
  • Jiaming Shen
    NIH BD2K Program Centers of Excellence for Big Data Computing-KnowEng Center, Department of Computer Science, University of Illinois at Urbana-Champaign , Champaign, Illinois.
  • Howard Choi
    NIH BD2K Program Centers of Excellence for Big Data Computing-Heart BD2K Center, Departments of Physiology, Medicine/Cardiology, and Bioinformatics, David Geffen School of Medicine, University of California , Los Angeles, California.
  • John H Caufield
    NIH BD2K Program Centers of Excellence for Big Data Computing-Heart BD2K Center, Departments of Physiology, Medicine/Cardiology, and Bioinformatics, David Geffen School of Medicine, University of California , Los Angeles, California.
  • Wei Wang
    State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences, University of Macau, Macau 999078, China.
  • Peipei Ping
    From the NIH BD2K Center of Excellence for Biomedical Computing at UCLA, Los Angeles, CA (P.P., K.W., A.B.); and NIH BD2K KnowEng Center of Excellence for Biomedical Computing at UIUC, Urbana, IL (J.H.). pping38@g.ucla.edu.
  • Jiawei Han
    Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA Institute of Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA.