A Machine-Learning-Based Approach to Prediction of Biogeographic Ancestry within Europe.

Journal: International journal of molecular sciences
Published Date:

Abstract

Data obtained with the use of massive parallel sequencing (MPS) can be valuable in population genetics studies. In particular, such data harbor the potential for distinguishing samples from different populations, especially from those coming from adjacent populations of common origin. Machine learning (ML) techniques seem to be especially well suited for analyzing large datasets obtained using MPS. The Slavic populations constitute about a third of the population of Europe and inhabit a large area of the continent, while being relatively closely related in population genetics terms. In this proof-of-concept study, various ML techniques were used to classify DNA samples from Slavic and non-Slavic individuals. The primary objective of this study was to empirically evaluate the feasibility of discerning the genetic provenance of individuals of Slavic descent who exhibit genetic similarity, with the overarching goal of categorizing DNA specimens derived from diverse Slavic population representatives. Raw sequencing data were pre-processed, to obtain a 1200 character-long binary vector. A total of three classifiers were used-Random Forest, Support Vector Machine (SVM), and XGBoost. The most-promising results were obtained using SVM with a linear kernel, with 99.9% accuracy and F1-scores of 0.9846-1.000 for all classes.

Authors

  • Anna Kloska
    Department of Forensic Medicine, The Ludwik Rydygier Collegium Medicum in Bydgoszcz, Nicolaus Copernicus University in Torun, 85067 Bydgoszcz, Poland.
  • Agata Giełczyk
    Bydgoszcz University of Science and Technology, Bydgoszcz, Poland.
  • Tomasz Grzybowski
    Department of Forensic Medicine, The Ludwik Rydygier Collegium Medicum in Bydgoszcz, Nicolaus Copernicus University in Torun, 85067 Bydgoszcz, Poland.
  • Rafał Płoski
    Department of Medical Genetics, Warsaw Medical University, 02106 Warsaw, Poland.
  • Sylwester M Kloska
    Department of Forensic Medicine, The Ludwik Rydygier Collegium Medicum in Bydgoszcz, Nicolaus Copernicus University in Torun, 85067 Bydgoszcz, Poland.
  • Tomasz Marciniak
    Faculty of Telecommunications, Computer Science and Electrical Engineering, Bydgoszcz University of Science and Technology, 85796 Bydgoszcz, Poland.
  • Krzysztof Pałczyński
    Faculty of Telecommunications, Computer Science and Electrical Engineering, Bydgoszcz University of Science and Technology, 85-796 Bydgoszcz, Poland.
  • Urszula Rogalla-Ładniak
    Department of Forensic Medicine, The Ludwik Rydygier Collegium Medicum in Bydgoszcz, Nicolaus Copernicus University in Torun, 85067 Bydgoszcz, Poland.
  • Boris A Malyarchuk
    Institute of Biological Problems of the North, Russian Academy of Sciences, 685000 Magadan, Russia.
  • Miroslava V Derenko
    Institute of Biological Problems of the North, Russian Academy of Sciences, 685000 Magadan, Russia.
  • Nataša Kovačević-Grujičić
    Institute of Molecular Genetics and Genetic Engineering, University of Belgrade, 11042 Belgrade, Serbia.
  • Milena Stevanović
    Institute of Molecular Genetics and Genetic Engineering, University of Belgrade, 11042 Belgrade, Serbia.
  • Danijela Drakulić
    Institute of Molecular Genetics and Genetic Engineering, University of Belgrade, 11042 Belgrade, Serbia.
  • Slobodan Davidović
    Institute for Biological Research "Siniša Stanković", National Institute of Republic of Serbia, University of Belgrade, 11060 Belgrade, Serbia.
  • Magdalena Spólnicka
    Center of Forensic Sicences, University of Warsaw, 00927 Warsaw, Poland.
  • Magdalena Zubańska
    Faculty of Law and Administration, Department of Criminology and Forensic Sciences, University of Warmia and Mazury, 10726 Olsztyn, Poland.
  • Marcin Wozniak
    Faculty of Applied Mathematics, Silesian University of Technology, 44-100 Gliwice, Poland.