Exploring species taxonomic kingdom using information entropy and nucleotide compositional features of coding sequences based on machine learning methods.

Journal: Methods (San Diego, Calif.)

Published Date: Apr 23, 2025

Abstract

The flow of genetic information from DNA to protein is governed by the central dogma of molecular biology. Genetic drift and mutations usually lead to changes in DNA composition, thereby affecting the coding sequences (CDS) that encode functional proteins. Analyzing the nucleotide distribution in the coding regions of species is crucial for understanding their evolution. In this study, we applied Markov processes to analyze codon formation in 37,031,061 CDSs across 3,735 species genomes, spanning viruses, archaea, bacteria, and eukaryotes, to explore compositional changes. Our results revealed species preferences for different nucleotides. Information entropies and Markov information densities show that eukaryotes exhibit higher redundancy, followed by viruses, suggesting more gene duplication in eukaryotes and high mutation rates in viruses. Evolutionary trends showed an increase in information entropy and a decrease in Markov entropy, with negative correlations between first- and second-order Markov information densities. Furthermore, uniform manifold approximation and projection (UMAP) was used to reduce information redundancy for revealing unique evolutionary patterns in species classification. The machine learning methods demonstrated excellent performance in species classification accuracy, providing profound insights into CDS evolution and protein synthesis.

Authors

Sebu Aboma Temesgen

School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, Sichuan, China.
Basharat Ahmad

School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, Sichuan, China.
Bakanina Kissanga Grace-Mercure

School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, 610054 Chengdu, Sichuan, China.
Minghao Liu

Key Laboratory for Molecular Enzymology and Engineering of Ministry of Education, School of Life Science, Jilin University, 2699 Qianjin Street, Changchun, 130012, China. Electronic address: lmh23@mails.jlu.edu.cn.
Li Liu

Metanotitia Inc., Shenzhen, China.
Hao Lin

Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, Zhejiang, China.
Kejun Deng

College of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, Sichuan, China. dengkj@uestc.edu.cn.

Keywords

Archaea Bacteria Base Composition Codon Entropy Eukaryota Evolution, Molecular Machine Learning Markov Chains Nucleotides Open Reading Frames Viruses

External Resources

View on PubMed Access via DOI PubMed (40280261)

Exploring species taxonomic kingdom using information entropy and nucleotide compositional features of coding sequences based on machine learning methods.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

Exploring species taxonomic kingdom using information entropy and nucleotide compositional features of coding sequences based on machine learning methods.

Abstract

Authors

Keywords

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals