Robust Multiclass Feature Selection for the Authentication of Honey Botanical Origin via Nontargeted LC-MS Analysis.

Journal: Analytical chemistry
Published Date:

Abstract

Honey is one of the most frequently frauded foods due to the high market price of certain kinds of monofloral honey. Traditional authentication methods involving pollen or targeted analysis have limitations that can be manipulated by fraudsters. Nontargeted analysis of honey via liquid chromatography-mass spectrometry (LC-MS) can provide data on thousands of chemical features. However, most studies that train machine learning models for food authentication have sample sizes in the tens or hundreds, which introduces the problem of overfitting when working with such a large feature-to-sample ratio. Herein, a recursive feature elimination (RFE) pipeline was developed specifically to address the challenges of optimizing the honey chemical fingerprint for multiclass machine learning classifiers on a limited number of samples with imperfect labels. A support vector machine was used for both RFE and classification to reduce the 2028 nontargeted features down to just 54 features (a 97.3% reduction) without any loss of classification performance. The resulting model was a 6-class classifier, capable of identifying monofloral blueberry, buckwheat, clover, goldenrod, linden, or other honey with a nested cross-validation Matthews correlation coefficient (MCC) of 0.803 ± 0.046. The development of a -nearest neighbors filter and the decision to continue the RFE process beyond the iteration with the highest classification score were instrumental in achieving this outcome. This work shows a complete pipeline that automates feature selection from nontargeted LC-MS spectra when working with a limited number of samples and imperfect labels. This process can also be expanded to other food groups and spectral data.

Authors

  • Shawninder Chahal
    , Department of Food Science and Agricultural Chemistry, McGill University, 21111 Lakeshore Rd, Sainte-Anne-de-Bellevue, Quebec H9X 3V9, Canada.
  • Lei Tian
    Department of Electrical and Computer Engineering, Boston University, 8 St. Mary's Street, RM 830, Boston, Massachusetts, 02215.
  • Shaghig Bilamjian
    , Department of Food Science and Agricultural Chemistry, McGill University, 21111 Lakeshore Rd, Sainte-Anne-de-Bellevue, Quebec H9X 3V9, Canada.
  • Ferenc Balogh
    , Department of Mathematics, John Abbott College, 21275 Lakeshore Rd, Sainte-Anne-de-Bellevue, Quebec H9X 3L9, Canada.
  • Lorna De Leoz
    , Agilent CrossLab Group, Agilent Technologies, 5301 Stevens Creek Blvd, Santa Clara, California 95051, United States.
  • Tarun Anumol
    , Agilent CrossLab Group, Agilent Technologies, 5301 Stevens Creek Blvd, Santa Clara, California 95051, United States.
  • Daniel Cuthbertson
    , Agilent CrossLab Group, Agilent Technologies, 5301 Stevens Creek Blvd, Santa Clara, California 95051, United States.
  • Stéphane Bayen
    , Department of Food Science and Agricultural Chemistry, McGill University, 21111 Lakeshore Rd, Sainte-Anne-de-Bellevue, Quebec H9X 3V9, Canada.