Optimization scheme of machine learning model for genetic division between northern Han, southern Han, Korean and Japanese.

Journal: Yi chuan = Hereditas
PMID:

Abstract

Han Chinese, Korean and Japanese are the main populations of East Asia, and Han Chinese presents a gradient admixture from north to south. There are differences among the East Asian populations in genetic structure. To achieve fine-scale genetic classification of southern (S-) and northern (N-) Han Chinese, Korean and Japanese individuals in this study, we collected and analyzed 1185 ancestry informative SNPs (AISNPs) from previous literature reports and our laboratory findings. First, two machine learning algorithms, softmax and randomForest, were used to build genetic classification models. Then, phylogenetic tree, STRUCTURE and principal component analysis were used to evaluate the performance of classification for different AISNP panels. The 234-AISNP panel achieved a fine-scale differentiation among the target populations in four classification schemes. The accuracy of the softmax model was 92%, which realized the accurate classification of the S-Han, N-Han, Korean and Japanese individuals. The two machine learning models tested in this study provided important references for the high-resolution discrimination of close-range populations and will be useful tools to optimize marker panels for developing forensic DNA ancestry inference systems.

Authors

  • Yong-Qiang Kong
    Key Laboratory of Tianjin for Epigenetics, Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China.
  • Jin-Kai Liu
    Key Laboratory of Tianjin for Epigenetics, Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China.
  • Jia-Qi Gu
    Key Laboratory of Phylogeny and Comparative Genomics of Jiangsu Province, Jiangsu Normal University, Xuzhou 221116, China.
  • Jing-Yi Xu
    Key Laboratory of Tianjin for Epigenetics, Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China.
  • Yu-Nuo Zheng
    Key Laboratory of Phylogeny and Comparative Genomics of Jiangsu Province, Jiangsu Normal University, Xuzhou 221116, China.
  • Yi-Liang Wei
    Key Laboratory of Phylogeny and Comparative Genomics of Jiangsu Province, Jiangsu Normal University, Xuzhou 221116, China.
  • Shao-Yuan Wu
    Key Laboratory of Tianjin for Epigenetics, Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China.