SAMSVM: A tool for misalignment filtration of SAM-format sequences with support vector machine.

Journal: Journal of bioinformatics and computational biology
PMID:

Abstract

Sequence alignment/map (SAM) formatted sequences [Li H, Handsaker B, Wysoker A et al., Bioinformatics 25(16):2078-2079, 2009.] have taken on a main role in bioinformatics since the development of massive parallel sequencing. However, because misalignment of sequences poses a significant problem in analysis of sequencing data that could lead to false positives in variant calling, the exclusion of misaligned reads is a necessity in analysis. In this regard, the multiple features of SAM-formatted sequences can be treated as vectors in a multi-dimension space to allow the application of a support vector machine (SVM). Applying the LIBSVM tools developed by Chang and Lin [Chang C-C, Lin C-J, ACM Trans Intell Syst Technol 2:1-27, 2011.] as a simple interface for support vector classification, the SAMSVM package has been developed in this study to enable misalignment filtration of SAM-formatted sequences. Cross-validation between two simulated datasets processed with SAMSVM yielded accuracies that ranged from 0.89 to 0.97 with F-scores ranging from 0.77 to 0.94 in 14 groups characterized by different mutation rates from 0.001 to 0.1, indicating that the model built using SAMSVM was accurate in misalignment detection. Application of SAMSVM to actual sequencing data resulted in filtration of misaligned reads and correction of variant calling.

Authors

  • Jianfeng Yang
    Department of Surgery, ShangNan Branch of Longhua Hospital, Shanghai University of Traditional Chinese Medicine, Shanghai, China.
  • Xiaofan Ding
    1 Division of Life Science, Applied Genomics Centre and Centre for Statistical Science, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, P. R. China.
  • Xing Sun
    1 Division of Life Science, Applied Genomics Centre and Centre for Statistical Science, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, P. R. China.
  • Shui-Ying Tsang
    1 Division of Life Science, Applied Genomics Centre and Centre for Statistical Science, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, P. R. China.
  • Hong Xue
    International Initiative on Spatial Lifecourse Epidemiology (ISLE), the Netherlands; Department of Health Behavior and Policy, School of Medicine, Virginia Commonwealth University, Richmond, VA, 23298, USA.