A comparison of RNA-Seq data preprocessing pipelines for transcriptomic predictions across independent studies.

Journal: BMC bioinformatics
Published Date:

Abstract

BACKGROUND: RNA sequencing combined with machine learning techniques has provided a modern approach to the molecular classification of cancer. Class predictors, reflecting the disease class, can be constructed for known tissue types using the gene expression measurements extracted from cancer patients. One challenge of current cancer predictors is that they often have suboptimal performance estimates when integrating molecular datasets generated from different labs. Often, the quality of the data is variable, procured differently, and contains unwanted noise hampering the ability of a predictive model to extract useful information. Data preprocessing methods can be applied in attempts to reduce these systematic variations and harmonize the datasets before they are used to build a machine learning model for resolving tissue of origins.

Authors

  • Richard Van
    Department of Chemistry and Biochemistry, University of Oklahoma, 101 Stephenson Parkway, Norman, Oklahoma 73019, United States.
  • Daniel Alvarez
  • Travis Mize
    Icahn School of Medicine at Mount Sinai, Institute for Genomic Health, New York City, NY, USA.
  • Sravani Gannavarapu
    Department of Computer Science, University of Nevada Las Vegas, Las Vegas, NV, USA.
  • Lohitha Chintham Reddy
    Department of Computer Science, University of Nevada Las Vegas, Las Vegas, NV, USA.
  • Fatma Nasoz
    Department of Computer Science, University of Nevada, Las Vegas, NV, USA.
  • Mira V Han
    Nevada Institute of Personalized Medicine, University of Nevada, Las Vegas, 4505 Maryland Parkway, Las Vegas, NV, 89154-4009, USA.