An interpretable machine learning framework for dog breed inference and ancestry decomposition
Journal:
bioRxiv
Published Date:
Jun 4, 2026
Abstract
The over 300 currently recognized breeds of domesticated dogs are the culmination of centuries of intense artificial selection and recurrent population bottlenecks. While breed labels are widely used in genetic and veterinary studies, inferring breed identity from genomic data remains challenging due to the high dimensionality of genotype data, uneven sampling across breeds, and admixture resulting in mixed-breed individuals. Here, we present an interpretable machine learning framework to infer dog breed labels from genome-wide SNP data. Our approach combines dimensionality reduction with a multi-output random forest model that maps genetic variation to a continuous representation of breed membership, enabling both classification and mixed-breed inference. We apply this framework to the Dog Aging Project (DAP) dataset of 6,572 purebred and mixed-breed dogs across 100 breed classes, achieving 91.7% accuracy with an overlap-based metric, outperforming an ADMIXTURE-based benchmark that achieved 87.8% accuracy. Notably, we find that as few as 150 informative SNPs are sufficient to achieve near-maximal predictive performance, highlighting the highly structured nature of canine genetic variation. We also introduce a SNP importance score metric that links model predictions back to individual genetic variants. Analysis of top-ranked variants reveals loci previously associated with morphological, pigmentation, and behavioral traits, as well as candidate loci lacking prior phenotypic annotation, supporting both the biological relevance and discovery potential of the framework. Together, these results demonstrate that our framework provides an accurate, flexible, and interpretable approach to predict breed ancestry, with applications in veterinary genomics, canine population genetics, and the identification of loci underlying hallmark breed phenotypes.