Data-driven, explainable machine learning model for predicting volatile organic compounds' standard vaporization enthalpy.

Journal: Chemosphere
PMID:

Abstract

The accurate prediction of standard vaporization enthalpy (ΔH°) for volatile organic compounds (VOCs) is of paramount importance in environmental chemistry, industrial applications and regulatory compliance. To overcome traditional experimental methods for predicting ΔH° of VOCs, machine learning (ML) models enable a high-throughput, cost-effective property estimation. But despite a rising momentum, existing ML algorithms still present limitations in prediction accuracy and broad chemical applications. In this work, we present a data driven, explainable supervised ML model to predict ΔH° of VOCs. The model was built on an established experimental database of 2410 unique molecules and 223 VOCs categorized by chemical groups. Using supervised ML regression algorithms, the Random Forest successfully predicted VOCs' ΔH° with a mean absolute error of 3.02 kJ mol and a 95% test score. The model was successfully validated through the prediction of ΔH° for a known database of VOCs and through molecular group hold-out tests. Through chemical feature importance analysis, this explainable model revealed that VOC polarizability, connectivity indexes and electrotopological state are key for the model's prediction accuracy. We thus present a replicable and explainable model, which can be further expanded towards the prediction of other thermodynamic properties of VOCs.

Authors

  • José Ferraz-Caetano
    Department of Chemistry and Biochemistry - Faculty of Sciences, University of Porto - Rua do Campo Alegre, S/N, 4169-007 Porto, Portugal.
  • Filipe Teixeira
    Centre of Chemistry, University of Minho, Campus de Gualtar, 4710-057 Braga, Portugal.
  • M Natália D S Cordeiro