Privacy-preserving Model Training for Disease Prediction Using Federated Learning with Differential Privacy.

Journal: Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference
Published Date:

Abstract

Machine learning is playing an increasingly critical role in health science with its capability of inferring valuable information from high-dimensional data. More training data provides greater statistical power to generate better models that can help decision-making in healthcare. However, this often requires combining research and patient data across institutions and hospitals, which is not always possible due to privacy considerations. In this paper, we outline a simple federated learning algorithm implementing differential privacy to ensure privacy when training a machine learning model on data spread across different institutions. We tested our model by predicting breast cancer status from gene expression data. Our model achieves a similar level of accuracy and precision as a single-site non-private neural network model when we enforce privacy. This result suggests that our algorithm is an effective method of implementing differential privacy with federated learning, and clinical data scientists can use our general framework to produce differentially private models on federated datasets. Our framework is available at https://github.com/gersteinlab/idash20FL.

Authors

  • Amol Khanna
  • Vincent Schaffer
  • Gamze Gürsoy
    Department of Biomedical Informatics, Department of Computer Science, Columbia University, New York Genome Center, New York, NY, USA.
  • Mark Gerstein
    Program of Computational Biology and Bioinformatics and Department of Molecular Biophysics and Biochemistry and Department of Computer Science, Yale University, New Haven, CT 06511, USA.