An Explainable Deep Learning Classifier of Bovine Mastitis Based on Whole-Genome Sequence Data-Circumventing the p >> n Problem.

Journal: International journal of molecular sciences
PMID:

Abstract

The serious drawback underlying the biological annotation of whole-genome sequence data is the p >> n problem, which means that the number of polymorphic variants (p) is much larger than the number of available phenotypic records (n). We propose a way to circumvent the problem by combining a LASSO logistic regression with deep learning to classify cows as susceptible or resistant to mastitis, based on single nucleotide polymorphism (SNP) genotypes. Among several architectures, the one with 204,642 SNPs was selected as the best. This architecture was composed of two layers with, respectively, 7 and 46 units per layer implementing respective drop-out rates of 0.210 and 0.358. The classification of the test data resulted in AUC = 0.750, accuracy = 0.650, sensitivity = 0.600, and specificity = 0.700. Significant SNPs were selected based on the SHapley Additive exPlanation (SHAP). As a final result, one GO term related to the biological process and thirteen GO terms related to molecular function were significantly enriched in the gene set that corresponded to the significant SNPs. Our findings revealed that the optimal approach can correctly predict susceptibility or resistance status for approximately 65% of cows. Genes marked by the most significant SNPs are related to the immune response and protein synthesis.

Authors

  • Krzysztof Kotlarz
    Biostatistics Group, Department of Genetics, Wroclaw University of Environmental and Life Sciences, Kozuchowska 7, 51-631, Wroclaw, Poland.
  • Magda Mielczarek
    Biostatistics Group, Department of Genetics, Wroclaw University of Environmental and Life Sciences, Kozuchowska 7, 51-631, Wroclaw, Poland.
  • Przemysław Biecek
    Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Banacha 2, 02-097 Warsaw, Poland.
  • Katarzyna Wojdak-Maksymiec
    Department of Genetics and Animal Breeding, West Pomeranian University of Technology, Aleja Piastow 45, 70-311 Szczecin, Poland.
  • Tomasz Suchocki
    Biostatistics Group, Department of Genetics, Wroclaw University of Environmental and Life Sciences, Kozuchowska 7, 51-631, Wroclaw, Poland.
  • Piotr Topolski
    National Research Institute of Animal Production, Krakowska 1, 32-083 Balice, Poland.
  • Wojciech Jagusiak
    National Research Institute of Animal Production, Krakowska 1, 32-083 Balice, Poland.
  • Joanna Szyda
    Biostatistics Group, Department of Genetics, Wroclaw University of Environmental and Life Sciences, Kozuchowska 7, 51-631, Wroclaw, Poland. joanna.szyda@upwr.edu.pl.