sstar2: A Python Package for S*-based Archaic Introgression Detection with Machine Learning
Journal:
bioRxiv
Published Date:
Jun 3, 2026
Abstract
Detecting introgressed genomic fragments from unsampled or extinct source populations remains challenging. The S* statistic is widely used for this purpose, but the original sstar implementation relies on generalized additive models to smooth quantile-specific values precomputed from fixed count bins, requiring simulations with fixed numbers of segregating sites. Here, we present sstar2, a Python update that replaces this procedure with quantile regression to directly estimate S* thresholds at specified null quantiles from simulated genomic windows. We benchmarked sstar2 against the original sstar, linear quantile regression, and random forest quantile regression across three demographic models with both phased and unphased simulated data. sstar2 showed the best overall performance among the evaluated methods, with the most pronounced improvement under a challenging demographic model of ghost introgression in bonobos. These results show that sstar2 improves S* threshold calibration while making S*-based introgression analyses more flexible and compatible with modern simulation workflows.