Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests.

Journal: BMC genomics

Published Date: Jan 21, 2015

Abstract

BACKGROUND: Single-nucleotide polymorphisms (SNPs) selection and identification are the most important tasks in Genome-wide association data analysis. The problem is difficult because genome-wide association data is very high dimensional and a large portion of SNPs in the data is irrelevant to the disease. Advanced machine learning methods have been successfully used in Genome-wide association studies (GWAS) for identification of genetic variants that have relatively big effects in some common, complex diseases. Among them, the most successful one is Random Forests (RF). Despite of performing well in terms of prediction accuracy in some data sets with moderate size, RF still suffers from working in GWAS for selecting informative SNPs and building accurate prediction models. In this paper, we propose to use a new two-stage quality-based sampling method in random forests, named ts-RF, for SNP subspace selection for GWAS. The method first applies p-value assessment to find a cut-off point that separates informative and irrelevant SNPs in two groups. The informative SNPs group is further divided into two sub-groups: highly informative and weak informative SNPs. When sampling the SNP subspace for building trees for the forest, only those SNPs from the two sub-groups are taken into account. The feature subspaces always contain highly informative SNPs when used to split a node at a tree.

Authors

Thanh-Tung Nguyen
Joshua Huang
Qingyao Wu
Thuy Nguyen
Mark Li

Keywords

Algorithms Alzheimer Disease Computational Biology Genetic Predisposition to Disease Genome-Wide Association Study Humans Machine Learning Models, Genetic Parkinson Disease Polymorphism, Single Nucleotide Reproducibility of Results

External Resources

View on PubMed Access via DOI PubMed (25708662)

Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals