A Support Vector Machine based method to distinguish long non-coding RNAs from protein coding transcripts.

Journal: BMC genomics
Published Date:

Abstract

BACKGROUND: In recent years, a rapidly increasing number of RNA transcripts has been generated by thousands of sequencing projects around the world, creating enormous volumes of transcript data to be analyzed. An important problem to be addressed when analyzing this data is distinguishing between long non-coding RNAs (lncRNAs) and protein coding transcripts (PCTs). Thus, we present a Support Vector Machine (SVM) based method to distinguish lncRNAs from PCTs, using features based on frequencies of nucleotide patterns and ORF lengths, in transcripts.

Authors

  • Hugo W Schneider
    Department of Computer Science, University of Brasilia, ICC Central, Instituto de Ciências Exatas, Campus Universitario Darcy Ribeiro, Asa Norte, CEP: 70910-900, Brasilia, Brazil. hugowschneider@gmail.com.
  • Taina Raiol
    Gerência Regional de Brasilia (GEREB), Oswaldo Cruz Foundation (Fiocruz), Av. L3 Norte, Campus Universitário Darcy Ribeiro, Gleba A, Asa Norte, CEP: 70910-900, Brasília, Brazil.
  • Marcelo M Brigido
    Laboratory of Molecular Biology, University of Brasilia, Instituto de Ciencias Biologicas, Campus Universitario Darcy Ribeiro, Asa Norte, CEP: 70910-900, Brasilia, Brazil.
  • Maria Emilia M T Walter
    Departamento de Ciência da Computação, Universidade de Brasília, Brasília-DF 70910-900, Brasil. mariaemilia@unb.br.
  • Peter F Stadler
    Bioinformatics Group, Department of Computer Science and Interdisciplinary Center for Bioinformatics, University of Leipzig, Hartelstrasse 16-18, Leipzig, D-04107, Germany.