The urgent need to accelerate synthetic data privacy frameworks for medical research.

Journal: The Lancet. Digital health
PMID:

Abstract

Synthetic data, generated through artificial intelligence technologies such as generative adversarial networks and latent diffusion models, maintain aggregate patterns and relationships present in the real data the technologies were trained on without exposing individual identities, thereby mitigating re-identification risks. This approach has been gaining traction in biomedical research because of its ability to preserve privacy and enable dataset sharing between organisations. Although the use of synthetic data has become widespread in other domains, such as finance and high-energy physics, use in medical research raises novel issues. The use of synthetic data as a method of preserving the privacy of data used to train models requires that the data are high fidelity with the original data to preserve utility, but must be sufficiently different as to protect against adversarial or accidental re-identification. There is a need for the development of standards for synthetic data generation and consensus standards for its evaluation. As synthetic data applications expand, ongoing legal and ethical evaluations are crucial to ensure that they remain a secure and effective tool for advancing medical research without compromising individual privacy.

Authors

  • Anmol Arora
    School of Clinical Medicine, University of Cambridge, Cambridge, UK.
  • Siegfried Karl Wagner
    NIHR Biomedical Research Centre, Moorfields Eye Hospital NHS Foundation Trust, London, UK; Institute of Ophthalmology, University College London, London, UK.
  • Robin Carpenter
    King's College London, London, UK.
  • Rajesh Jena
    Health Intelligence, Microsoft Research, Cambridge, United Kingdom.
  • Pearse A Keane
    National Institute for Health Research Biomedical Research Centre for Ophthalmology, Moorfields Eye Hospital NHS Foundation Trust and UCL Institute of Ophthalmology, London, UK.