Estimating prevalence with precision and accuracy
Journal:
arXiv
Published Date:
Jul 8, 2025
Abstract
Unlike classification, whose goal is to estimate the class of each data point
in a dataset, prevalence estimation or quantification is a task that aims to
estimate the distribution of classes in a dataset. The two main tasks in
prevalence estimation are to adjust for bias, due to the prevalence in the
training dataset, and to quantify the uncertainty in the estimate. The standard
methods used to quantify uncertainty in prevalence estimates are bootstrapping
and Bayesian quantification methods. It is not clear which approach is ideal in
terms of precision (i.e. the width of confidence intervals) and coverage (i.e.
the confidence intervals being well-calibrated). Here, we propose Precise
Quantifier (PQ), a Bayesian quantifier that is more precise than existing
quantifiers and with well-calibrated coverage. We discuss the theory behind PQ
and present experiments based on simulated and real-world datasets. Through
these experiments, we establish the factors which influence quantification
precision: the discriminatory power of the underlying classifier; the size of
the labeled dataset used to train the quantifier; and the size of the unlabeled
dataset for which prevalence is estimated. Our analysis provides deep insights
into uncertainty quantification for quantification learning.