MLDeCNV: A machine learning approach for predicting copy number variation types in plant genomes.
Journal:
Computers in biology and medicine
Published Date:
Dec 22, 2025
Abstract
Copy number variations (CNVs) play a crucial role in shaping genetic diversity and influencing various plant traits. However, existing methods for CNV characterization often face challenges due to the complexity and repetitive nature of plant genomes. Here, we present MLDeCNV (Machine Learning for Decoding Copy Number Variation) a novel open-source machine-learning based tool optimized for predicting CNV types (deletions, duplications, and non-CNVs) in plant genomes. Built on the XGBoost model, MLDeCNV utilizes 32 selected CNV-related features derived from coverage metrics, nucleotide composition, and sequencing statistics. The model was trained on a high-confidence CNV dataset comprising of experimentally validated and computationally predicted CNVs. It exhibits strong performance across various CNV size ranges and training set sizes, achieving an accuracy of 89.27 %, with precision, recall, and F1-score, all at 89.3 %, and an Area Under Curve of 0.9783, underscoring its robustness and reliability. Extensive comparisons with traditional machine learning models reveal that XGBoost outperforms other methods, particularly in handling complex, nonlinear interactions within the CNV data. Additionally, while MLDeCNV does not perform de novo CNV detection, it evaluates CNV type classification from pre-identified genomic regions, making it a post-detection classification tool. This tool, accessible at http://46.202.167.198:5004/ can be integrated downstream of CNV detection pipelines, enhancing the accuracy of CNV type categorization. The precise classification of CNV types from pre-identified genomic regions will streamline downstream genomic analyses, facilitating enhanced understanding and utilization of genetic variation in plants.