Addressing data handling shortcomings in machine learning studies on biochar for heavy metal remediation.

Journal: Journal of hazardous materials
Published Date:

Abstract

Recent advancements in machine learning (ML) technologies have significantly enhanced their applications in environmental sciences, particularly in the domains of soil and water remediation. This paper reviews recent studies that employ ML to optimize the use of biochar for heavy metal adsorption. It highlights critical data handling shortcomings, such as data leakage and inadequate data splits, which potentially undermine the reliability and generalizability of research findings. This paper specifically addresses challenges related to data leakage and improper splitting of data sets, emphasizing the necessity for rigorous data management practices. Data in this context arise from a compilation of experimental studies and are typically grouped based on specific experimental conditions and biochar types. Such grouping leads to non-independence among data points within the same group due to shared characteristics and experimental conditions. The paper discusses methodologies to enhance data integrity and improve the representativeness of ML applications in environmental science. Through these discussions, it aims to guide future research toward developing more robust, reliable, and applicable ML-driven strategies for environmental remediation.

Authors

  • Destika Cahyana
    Research Center for Food Crops, Research Organization for Agriculture and Food, National Research and Innovation Agency (BRIN), Indonesia. Electronic address: destika.cahyana@brin.go.id.
  • Ho Jun Jang
    Sydney Institute of Agriculture, The University of Sydney, NSW 2006, Australia.