Is replacing missing values of PM constituents with estimates using machine learning better for source apportionment than exclusion or median replacement?

Journal: Environmental pollution (Barking, Essex : 1987)
Published Date:

Abstract

East Asian countries have been conducting source apportionment of fine particulate matter (PM) by applying positive matrix factorization (PMF) to hourly constituent concentrations. However, some of the constituent data from the supersites in South Korea was missing due to instrument maintenance and calibration. Conventional preprocessing of missing values, such as exclusion or median replacement, causes biases in the estimated source contributions by changing the PMF input. Machine learning (ML) can estimate the missing values by training on constituent data, meteorological data, and gaseous pollutants. Complete data from the Seoul Supersite in 2018 was taken, and a random 20% was set as missing. PMF was performed by replacing missing values with estimates. Percent errors of the source contributions were calculated compared to those estimated from complete data. Missing values were estimated using a random forest analysis. Estimation accuracy (r) was as high as 0.874 for missing carbon species and low at 0.631 when ionic species and trace elements were missing. For the seven highest contributing sources, replacing the missing values of carbon species with estimates minimized the percent errors to 2.0% on average. However, replacing the missing values of the other chemical species with estimates increased the percent errors to more than 9.7% on average. Percent errors were maximal at 37% on average when missing values of ionic species and trace elements were replaced with estimates. Missing values, except for carbon species, need to be excluded. This approach reduced the percent errors to 7.4% on average, which was lower than those due to median replacement. Our results show that reducing the biases in source apportionment is possible by replacing the missing values of carbon species with estimates. To improve the biases due to missing values of the other chemical species, the estimation accuracy of the ML needs to be improved.

Authors

  • Youngkwon Kim
    Department of Environmental Health Sciences, Graduate School of Public Health, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul, 08826, Republic of Korea.
  • Seung-Muk Yi
    Department of Environmental Health Sciences, Graduate School of Public Health, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul, 08826, Republic of Korea; Institute of Health and Environment, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul, 08826, Republic of Korea.
  • Jongbae Heo
    Busan Development Institute, Busan, 47210, Republic of Korea.
  • Hwajin Kim
    Department of Environmental Health Sciences, Graduate School of Public Health, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul, 08826, Republic of Korea.
  • Woojoo Lee
    From the Department of Radiology, Seoul National University Bundang Hospital, 300 Gumi-dong, Bundang-gu, Seongnam-si, Gyeonggi-do 13620, Korea (S.J., H.S., Junghoon Kim, Jihang Kim, K.W.L., S.S.L., K.H.L.); Department of Radiology, Konkuk University Medical Center, Seoul, Korea (Y.J.S.); Seoul National University College of Medicine, Institute of Radiation Medicine, Seoul National University Medical Research Center, Seoul, Korea (K.W.L.); Department of Public Health Science, Graduate School of Public Health, Seoul National University, Seoul, Korea (W.L.); and Program in Biomedical Radiation Sciences, Department of Transdisciplinary Studies, Graduate School of Convergence Science and Technology, Seoul National University, Seoul, Korea (S.L.).
  • Ho Kim
    Graduate School of Public Health, Seoul National University, Seoul, South Korea.
  • Philip K Hopke
  • Young Su Lee
    Department of Energy and Environmental Engineering, Soonchunhyang University, Soonchunhyang-ro, Sinchang-myeon, Asan-si, Chungcheongnam-do, 31538, Republic of Korea.
  • Hye-Jung Shin
    Air Quality Research Division, Department of Climate and Air Quality Research, National Institute of Environmental Research, Incheon, 22689, Republic of Korea.
  • Jungmin Park
    Air Quality Research Division, Department of Climate and Air Quality Research, National Institute of Environmental Research, Incheon, 22689, Republic of Korea.
  • Myungsoo Yoo
    Department of Climate and Air Quality Research, National Institute of Environmental Research, Incheon, 22689, Republic of Korea.
  • Kwonho Jeon
    Global Environment Research Division, Department of Climate and Air Quality Research, National Institute of Environmental Research, Incheon, 22689, Republic of Korea.
  • Jieun Park
    Department of Environmental Health, Harvard T.H. Chan School of Public Health, 401 Park Drive, Boston, MA, 02215, USA. Electronic address: jieun_park@hsph.harvard.edu.