Contrastive Learning-Based privacy metrics in Tabular Synthetic Datasets
Journal:
arXiv
Published Date:
Feb 19, 2025
Abstract
Synthetic data has garnered attention as a Privacy Enhancing Technology (PET)
in sectors such as healthcare and finance. When using synthetic data in
practical applications, it is important to provide protection guarantees. In
the literature, two family of approaches are proposed for tabular data: on the
one hand, Similarity-based methods aim at finding the level of similarity
between training and synthetic data. Indeed, a privacy breach can occur if the
generated data is consistently too similar or even identical to the train data.
On the other hand, Attack-based methods conduce deliberate attacks on synthetic
datasets. The success rates of these attacks reveal how secure the synthetic
datasets are.
In this paper, we introduce a contrastive method that improves privacy
assessment of synthetic datasets by embedding the data in a more representative
space. This overcomes obstacles surrounding the multitude of data types and
attributes. It also makes the use of intuitive distance metrics possible for
similarity measurements and as an attack vector. In a series of experiments
with publicly available datasets, we compare the performances of
similarity-based and attack-based methods, both with and without use of the
contrastive learning-based embeddings. Our results show that relatively
efficient, easy to implement privacy metrics can perform equally well as more
advanced metrics explicitly modeling conditions for privacy referred to by the
GDPR.