MLDeCNV: A machine learning approach for predicting copy number variation types in plant genomes.

Journal: Computers in biology and medicine
Published Date:

Abstract

Copy number variations (CNVs) play a crucial role in shaping genetic diversity and influencing various plant traits. However, existing methods for CNV characterization often face challenges due to the complexity and repetitive nature of plant genomes. Here, we present MLDeCNV (Machine Learning for Decoding Copy Number Variation) a novel open-source machine-learning based tool optimized for predicting CNV types (deletions, duplications, and non-CNVs) in plant genomes. Built on the XGBoost model, MLDeCNV utilizes 32 selected CNV-related features derived from coverage metrics, nucleotide composition, and sequencing statistics. The model was trained on a high-confidence CNV dataset comprising of experimentally validated and computationally predicted CNVs. It exhibits strong performance across various CNV size ranges and training set sizes, achieving an accuracy of 89.27 %, with precision, recall, and F1-score, all at 89.3 %, and an Area Under Curve of 0.9783, underscoring its robustness and reliability. Extensive comparisons with traditional machine learning models reveal that XGBoost outperforms other methods, particularly in handling complex, nonlinear interactions within the CNV data. Additionally, while MLDeCNV does not perform de novo CNV detection, it evaluates CNV type classification from pre-identified genomic regions, making it a post-detection classification tool. This tool, accessible at http://46.202.167.198:5004/ can be integrated downstream of CNV detection pipelines, enhancing the accuracy of CNV type categorization. The precise classification of CNV types from pre-identified genomic regions will streamline downstream genomic analyses, facilitating enhanced understanding and utilization of genetic variation in plants.

Authors

  • Parinita Das
    Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India; The Graduate School, ICAR- Indian Agricultural Research Institute, New Delhi, India; Department of Agricultural Biotechnology & Molecular Biology, College of Basic Sciences and Humanities, Dr Rajendra Prasad Central Agricultural University, Pusa, Samastipur, Bihar, India.
  • Bibek Saha
    Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India.
  • Nitesh Kumar Sharma
    Studio of Computational Biology & Bioinformatics, The Himalayan Centre for High-throughput Computational Biology, (HiCHiCoB, A BIC supported by DBT, India)CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur (HP), India.
  • Mir Asif Iquebal
    Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India.
  • Alexie Papanicolaou
    Hawkesbury Institute for the Environment, Western Sydney University, Richmond, 2753, Australia.
  • U B Angadi
    Indian Council of Agricultural Research-Indian Agricultural Statistics Research Institute, New Delhi, India.
  • Dinesh Kumar
    a Department of Mechanical and Industrial Engineering , Indian Institute of Technology Roorkee , Roorkee , India.
  • Sarika Jaiswal
    ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.