A Comparative Analysis of Data Synthesis Techniques to Improve Classification Accuracy of Raman Spectroscopy Data.

Journal: Journal of chemical information and modeling
Published Date:

Abstract

Raman spectra are examples of high dimensional data that can often be limited in the number of samples. This is a primary concern when Deep Learning frameworks are developed for tasks such as chemical species identification, quantification, and diagnostics. Open-source data are difficult to obtain and often sparse; furthermore, the collecting and curating of new spectra require expertise and resources. Deep generative modeling utilizes Deep Learning architectures to approximate high dimensional distributions and aims to generate realistic synthetic data. The evaluation of the data and the performance of the deep models is usually conducted on a per-task basis and provides no indication of an increase to robustness, or generalization, on a wider scale. In this study, we compare the benefits and limitations of a standard statistical approach to data synthesis () with a popular deep generative model, the . Two binary data sets are divided into 3-fold to simulate small, limited samples. Synthetic data distributions are created per fold using the two methods and then augmented into the training of two Deep Learning algorithms, a and a . The goal of this study is to observe the trends in learning as synthetic data are continually augmented to the training data in increasing batches. To determine the impact of each synthetic method, and the are implemented to visualize and measure the distance between the source and synthetic distributions along with the Machine Learning metric for evaluating performance on imbalanced data.

Authors

  • Aaron R Flanagan
    School of Computer Science, University of Galway, Co. Galway H91 FYH2, Ireland.
  • Frank G Glavin
    School of Computer Science, National University of Ireland, Galway H91 TK33, Ireland.