Novel natural vector with asymmetric covariance for classifying biological sequences.

Journal: Gene
Published Date:

Abstract

The genome sequences of organisms form a large and complex landscape, presenting a significant challenge in bioinformatics: how to utilize mathematical tools to describe and analyze this space effectively. The ability to compare relationships between different organisms depends on creating a rational mapping rule that can uniformly encode genome sequences of varying lengths as vectors in a measurable space. This mapping would enable researchers to apply modern mathematical and machine learning techniques to otherwise challenging genomic comparisons. The natural vector method has been proposed as a concise and effective approach to accomplish this. However, its various iterations have certain limitations. In response, we carefully analyze the strengths and weaknesses of these natural vector methods and propose an improved version-an asymmetric covariance natural vector method (ACNV). This new method incorporates k-mer information alongside covariance computations with asymmetric properties between base positions. We tested ACNV on microbial genome sequence datasets, including bacterial, fungal, and viral sequences, evaluating its performance in terms of classification accuracy and convex hull separation. The results demonstrate that ACNV effectively captures sequence characteristics, showcasing its robust sequence representation capabilities and highlighting its elegant geometric properties.

Authors

  • Guoqing Hu
    Beijing Institute of Mathematical Sciences and Applications (BIMSA), 101408, Beijing, China. Electronic address: drhu@bimsa.cn.
  • Tao Zhou
    Department of Otorhinolaryngology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei, China.
  • Piyu Zhou
    Beijing Institute of Mathematical Sciences and Applications (BIMSA), 101408, Beijing, China; State Key Laboratory of Mathematical Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, 100190, Beijing, China; University of Chinese Academy of Sciences, 100049, Beijing, China.
  • Stephen Shing-Toung Yau
    Beijing Institute of Mathematical Sciences and Applications (BIMSA), 101408, Beijing, China; Department of Mathematical Sciences, Tsinghua University, 100084, Beijing, China. Electronic address: yau@uic.edu.