Phylogenetically Dispersed Subsetting for Species-Level Machine Learning Evaluation: Dependence-Aware Validation and Limited Effective Information

Journal: bioRxiv
Published Date:

Abstract

Machine learning is increasingly applied to species-level biological data, but phylogenetic autocorrelation can make evaluation species statistically non-independent, violating the assumption of independence in model evaluation and potentially leading to overconfident performance claims through phylogenetic interpolation. We present a dependence-aware framework, implemented in the R package PhyloSubset, for constructing phylogenetically dispersed species subsets from a user-defined candidate pool. The framework treats subset construction as an optimization problem based on distance-based criteria that capture closest-pair separation, overall phylogenetic spread, and nearest-neighbor spacing. By changing the optimization objective, the same framework can also construct phylogenetically clustered subsets as high-dependence contrast cases. Selected subsets are evaluated against empirical null distributions generated by repeated random sampling and are further assessed using diagnostics derived from the within-subset correlation structure, including mean-based effective sample size (MeanESS). Using Carnivora and Cricetidae as empirical case studies, we show that dispersed and clustered subsets occupy opposite tails of random-subset distributions for both distance-based metrics and covariance-based dependence diagnostics. However, phylogenetically dispersed subsetting reduced but did not eliminate internal dependence: in the Carnivora example, a nominal 20-species dispersed subset had a MeanESS of only 4.66 under a Brownian-motion covariance structure, and MeanESS exceeded half of the nominal subset size only when the assumed phylogenetic covariance was substantially weakened. These results show that phylogenetically dispersed subsetting can provide stricter and more reproducible evaluation subsets, while also revealing how little effective information may remain in species-level benchmarks. More broadly, PhyloSubset provides a practical foundation for dependence-aware validation strategies in species-level machine learning.

Authors

  • Huang
  • R.; Qi
  • B.; Niu
  • D.-K.