Clustering high throughput biological data with B-MST, a minimum spanning tree based heuristic.

Journal: Computers in biology and medicine
Published Date:

Abstract

To address important challenges in bioinformatics, high throughput data technologies are needed to interpret biological data efficiently and reliably. Clustering is widely used as a first step to interpreting high dimensional biological data, such as the gene expression data measured by microarrays. A good clustering algorithm should be efficient, reliable, and effective, as demonstrated by its capability of determining biologically relevant clusters. This paper proposes a new minimum spanning tree based heuristic B-MST, that is guided by an innovative objective function: the tightness and separation index (TSI). The TSI presented here obtains biologically meaningful clusters, making use of co-expression network topology, and this paper develops a local search procedure to minimize the TSI value. The proposed B-MST is tested by comparing results to: (1) adjusted rand index (ARI), for microarray data sets with known object classes, and (2) gene ontology (GO) annotations for data sets without documented object classes.

Authors

  • Harun Pirim
    Department of Systems Engineering, King Fahd University of Petroleum and Minerals, 31261, KSA. Electronic address: harunpirim@gmail.com.
  • Burak Ekşioğlu
    Department of Industrial Engineering, Clemson University, Clemson, SC 29634, USA.
  • Andy D Perkins
    Department of Computer Science and Engineering, Mississippi State University, 39762, USA.