A Systematic Study of Popular Software Packages and AI/ML Models for Calibrating In Situ Air Quality Data: An Example with Purple Air Sensors.

Journal: Sensors (Basel, Switzerland)
PMID:

Abstract

Accurate air pollution monitoring is critical to understand and mitigate the impacts of air pollution on human health and ecosystems. Due to the limited number and geographical coverage of advanced, highly accurate sensors monitoring air pollutants, many low-cost and low-accuracy sensors have been deployed. Calibrating low-cost sensors is essential to fill the geographical gap in sensor coverage. We systematically examined how different machine learning (ML) models and open-source packages could help improve the accuracy of particulate matter (PM) 2.5 data collected by Purple Air sensors. Eleven ML models and five packages were examined. This systematic study found that both models and packages impacted accuracy, while the random training/testing split ratio (e.g., 80/20 vs. 70/30) had minimal impact (0.745% difference for R). Long Short-Term Memory (LSTM) models trained in RStudio and TensorFlow excelled, with high R scores of 0.856 and 0.857 and low Root Mean Squared Errors (RMSEs) of 4.25 µg/m and 4.26 µg/m, respectively. However, LSTM models may be too slow (1.5 h) or computation-intensive for applications with fast response requirements. Tree-boosted models including XGBoost (0.7612, 5.377 µg/m) in RStudio and Random Forest (RF) (0.7632, 5.366 µg/m) in TensorFlow offered good performance with shorter training times (<1 min) and may be suitable for such applications. These findings suggest that AI/ML models, particularly LSTM models, can effectively calibrate low-cost sensors to produce precise, localized air quality data. This research is among the most comprehensive studies on AI/ML for air pollutant calibration. We also discussed limitations, applicability to other sensors, and the explanations for good model performances. This research can be adapted to enhance air quality monitoring for public health risk assessments, support broader environmental health initiatives, and inform policy decisions.

Authors

  • Seren Smith
    NSF Spatiotemporal Innovation Center, George Mason University, 4400 University Dr., Fairfax, VA 22030, USA.
  • Theodore Trefonides
    NSF Spatiotemporal Innovation Center, George Mason University, 4400 University Dr., Fairfax, VA 22030, USA.
  • Anusha Srirenganathan Malarvizhi
    NSF Spatiotemporal Innovation Center, George Mason University, 4400 University Dr., Fairfax, VA 22030, USA.
  • Shyra LaGarde
    NSF Spatiotemporal Innovation Center, George Mason University, 4400 University Dr., Fairfax, VA 22030, USA.
  • Jiakang Liu
    NSF Spatiotemporal Innovation Center, George Mason University, 4400 University Dr., Fairfax, VA 22030, USA.
  • Xiaoguo Jia
    NSF Spatiotemporal Innovation Center, George Mason University, 4400 University Dr., Fairfax, VA 22030, USA.
  • Zifu Wang
    Department of Geography and Geoinformation Science, NSF Spatiotemporal Innovation Center, George Mason University, Fairfax, USA. zwang31@gmu.edu.
  • Jacob Cain
    NSF Spatiotemporal Innovation Center, George Mason University, 4400 University Dr., Fairfax, VA 22030, USA.
  • Thomas Huang
    Department of Emergency Medicine, Yale School of Medicine, New Haven, CT, United States; Department for Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT, United States.
  • Mohammad Pourhomayoun
    Department of Computer Science, California State University, 1250 Bellflower Blvd, Long Beach, CA 90840, USA.
  • Grace Llewellyn
    NASA Jet Propulsion Laboratory, 4800 Oak Grove Dr., Pasadena, CA 91011, USA.
  • Wai Phyo
    NASA Jet Propulsion Laboratory, 4800 Oak Grove Dr., Pasadena, CA 91011, USA.
  • Sina Hasheminassab
    NASA Jet Propulsion Laboratory, 4800 Oak Grove Dr., Pasadena, CA 91011, USA.
  • Joe Roberts
    NASA Jet Propulsion Laboratory, 4800 Oak Grove Dr., Pasadena, CA 91011, USA.
  • Kevin Marlis
    NASA Jet Propulsion Laboratory, 4800 Oak Grove Dr., Pasadena, CA 91011, USA.
  • Daniel Q Duffy
    NASA Goddard Space Flight Center, Greenbelt, MD 220771, USA.
  • Chaowei Yang
    Department of Geography and Geoinformation Science, NSF Spatiotemporal Innovation Center, George Mason University, Fairfax, USA.