MLDeCNV: A machine learning approach for predicting copy number variation types in plant genomes.

Journal: Computers in biology and medicine

Published Date: Dec 22, 2025

Abstract

Copy number variations (CNVs) play a crucial role in shaping genetic diversity and influencing various plant traits. However, existing methods for CNV characterization often face challenges due to the complexity and repetitive nature of plant genomes. Here, we present MLDeCNV (Machine Learning for Decoding Copy Number Variation) a novel open-source machine-learning based tool optimized for predicting CNV types (deletions, duplications, and non-CNVs) in plant genomes. Built on the XGBoost model, MLDeCNV utilizes 32 selected CNV-related features derived from coverage metrics, nucleotide composition, and sequencing statistics. The model was trained on a high-confidence CNV dataset comprising of experimentally validated and computationally predicted CNVs. It exhibits strong performance across various CNV size ranges and training set sizes, achieving an accuracy of 89.27 %, with precision, recall, and F1-score, all at 89.3 %, and an Area Under Curve of 0.9783, underscoring its robustness and reliability. Extensive comparisons with traditional machine learning models reveal that XGBoost outperforms other methods, particularly in handling complex, nonlinear interactions within the CNV data. Additionally, while MLDeCNV does not perform de novo CNV detection, it evaluates CNV type classification from pre-identified genomic regions, making it a post-detection classification tool. This tool, accessible at http://46.202.167.198:5004/ can be integrated downstream of CNV detection pipelines, enhancing the accuracy of CNV type categorization. The precise classification of CNV types from pre-identified genomic regions will streamline downstream genomic analyses, facilitating enhanced understanding and utilization of genetic variation in plants.

Authors

Parinita Das

Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India; The Graduate School, ICAR- Indian Agricultural Research Institute, New Delhi, India; Department of Agricultural Biotechnology & Molecular Biology, College of Basic Sciences and Humanities, Dr Rajendra Prasad Central Agricultural University, Pusa, Samastipur, Bihar, India.
Bibek Saha

Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India.
Nitesh Kumar Sharma

Studio of Computational Biology & Bioinformatics, The Himalayan Centre for High-throughput Computational Biology, (HiCHiCoB, A BIC supported by DBT, India)CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur (HP), India.
Mir Asif Iquebal

Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India.
Alexie Papanicolaou

Hawkesbury Institute for the Environment, Western Sydney University, Richmond, 2753, Australia.
U B Angadi

Indian Council of Agricultural Research-Indian Agricultural Statistics Research Institute, New Delhi, India.
Dinesh Kumar

a Department of Mechanical and Industrial Engineering , Indian Institute of Technology Roorkee , Roorkee , India.
Sarika Jaiswal

ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.

Keywords

DNA Copy Number Variations Genome, Plant Machine Learning Software

External Resources

View on PubMed Access via DOI PubMed (41435499)

MLDeCNV: A machine learning approach for predicting copy number variation types in plant genomes.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

MLDeCNV: A machine learning approach for predicting copy number variation types in plant genomes.

Abstract

Authors

Keywords

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals