Exploring species taxonomic kingdom using information entropy and nucleotide compositional features of coding sequences based on machine learning methods.

Journal: Methods (San Diego, Calif.)
Published Date:

Abstract

The flow of genetic information from DNA to protein is governed by the central dogma of molecular biology. Genetic drift and mutations usually lead to changes in DNA composition, thereby affecting the coding sequences (CDS) that encode functional proteins. Analyzing the nucleotide distribution in the coding regions of species is crucial for understanding their evolution. In this study, we applied Markov processes to analyze codon formation in 37,031,061 CDSs across 3,735 species genomes, spanning viruses, archaea, bacteria, and eukaryotes, to explore compositional changes. Our results revealed species preferences for different nucleotides. Information entropies and Markov information densities show that eukaryotes exhibit higher redundancy, followed by viruses, suggesting more gene duplication in eukaryotes and high mutation rates in viruses. Evolutionary trends showed an increase in information entropy and a decrease in Markov entropy, with negative correlations between first- and second-order Markov information densities. Furthermore, uniform manifold approximation and projection (UMAP) was used to reduce information redundancy for revealing unique evolutionary patterns in species classification. The machine learning methods demonstrated excellent performance in species classification accuracy, providing profound insights into CDS evolution and protein synthesis.

Authors

  • Sebu Aboma Temesgen
    School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, Sichuan, China.
  • Basharat Ahmad
    School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, Sichuan, China.
  • Bakanina Kissanga Grace-Mercure
    School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, 610054 Chengdu, Sichuan, China.
  • Minghao Liu
    Key Laboratory for Molecular Enzymology and Engineering of Ministry of Education, School of Life Science, Jilin University, 2699 Qianjin Street, Changchun, 130012, China. Electronic address: lmh23@mails.jlu.edu.cn.
  • Li Liu
    Metanotitia Inc., Shenzhen, China.
  • Hao Lin
    Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, Zhejiang, China.
  • Kejun Deng
    College of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, Sichuan, China. dengkj@uestc.edu.cn.