Synthetic Patient Data Generation and Evaluation in Disease Prediction Using Small and Imbalanced Datasets.

Journal: IEEE journal of biomedical and health informatics
Published Date:

Abstract

The increasing prevalence of chronic non-communicable diseases makes it a priority to develop tools for enhancing their management. On this matter, Artificial Intelligence algorithms have proven to be successful in early diagnosis, prediction and analysis in the medical field. Nonetheless, two main issues arise when dealing with medical data: lack of high-fidelity datasets and maintenance of patient's privacy. To face these problems, different techniques of synthetic data generation have emerged as a possible solution. In this work, a framework based on synthetic data generation algorithms was developed. Eight medical datasets containing tabular data were used to test this framework. Three different statistical metrics were used to analyze the preservation of synthetic data integrity and six different synthetic data generation sizes were tested. Besides, the generated synthetic datasets were used to train four different supervised Machine Learning classifiers alone, and also combined with the real data. F1-score was used to evaluate classification performance. The main goal of this work is to assess the feasibility of the use of synthetic data generation in medical data in two ways: preservation of data integrity and maintenance of classification performance.

Authors

  • Antonio J Rodriguez-Almeida
  • Himar Fabelo
    Institute for Applied Microelectronics (IUMA), University of Las Palmas de Gran Canaria (ULPGC), Campus de Tafira, 35017 Las Palmas, Spain. hfabelo@iuma.ulpgc.es.
  • Samuel Ortega
    Institute for Applied Microelectronics (IUMA), University of Las Palmas de Gran Canaria (ULPGC), Campus de Tafira, 35017 Las Palmas, Spain. sortega@iuma.ulpgc.es.
  • Alejandro Deniz
  • Francisco J Balea-Fernandez
  • Eduardo Quevedo
  • Cristina Soguero-Ruiz
    Department of Signal Theory and Communications, Telematics and Computing Systems, Rey Juan Carlos University, Madrid, Spain. Electronic address: cristina.soguero@urjc.es.
  • Ana M Wägner
    Instituto Universitario de Investigaciones Biomédicas y Sanitarias, Universidad de Las Palmas de Gran Canaria, Las Palmas de Gran Canaria, Spain. Electronic address: ana.wagner@ulpgc.es.
  • Gustavo M Callico
    Institute for Applied Microelectronics (IUMA), University of Las Palmas de Gran Canaria (ULPGC), Campus de Tafira, 35017 Las Palmas, Spain. gustavo@iuma.ulpgc.es.