The Importance of Being Discrete: Measuring the Impact of Discretization in End-to-End Differentially Private Synthetic Data
Journal:
arXiv
Published Date:
Apr 9, 2025
Abstract
Differentially Private (DP) generative marginal models are often used in the
wild to release synthetic tabular datasets in lieu of sensitive data while
providing formal privacy guarantees. These models approximate low-dimensional
marginals or query workloads; crucially, they require the training data to be
pre-discretized, i.e., continuous values need to first be partitioned into
bins. However, as the range of values (or their domain) is often inferred
directly from the training data, with the number of bins and bin edges
typically defined arbitrarily, this approach can ultimately break end-to-end DP
guarantees and may not always yield optimal utility.
In this paper, we present an extensive measurement study of four
discretization strategies in the context of DP marginal generative models. More
precisely, we design DP versions of three discretizers (uniform, quantile, and
k-means) and reimplement the PrivTree algorithm. We find that optimizing both
the choice of discretizer and bin count can improve utility, on average, by
almost 30% across six DP marginal models, compared to the default strategy and
number of bins, with PrivTree being the best-performing discretizer in the
majority of cases. We demonstrate that, while DP generative models with
non-private discretization remain vulnerable to membership inference attacks,
applying DP during discretization effectively mitigates this risk. Finally, we
propose an optimized approach for automatically selecting the optimal number of
bins, achieving high utility while reducing both privacy budget consumption and
computational overhead.