TY - JOUR
T1 - Use of data imputation tools to reconstruct incomplete air quality datasets
T2 - A case-study in Temuco, Chile
AU - Quinteros, María Elisa
AU - Lu, Siyao
AU - Blazquez, Carola
AU - Cárdenas-R, Juan Pablo
AU - Ossa, Ximena
AU - Delgado-Saborit, Juana María
AU - Harrison, Roy M.
AU - Ruiz-Rudolph, Pablo
N1 - Funding Information:
This work was supported as part of the project: “Impact of Wood Burning Air Pollution on Preeclampsia and other Pregnancy Outcomes in Temuco, Chile” (DPI20140093) by CONICYT and Research Councils UK. Juana Maria Delgado-Saborit is supported by the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 750531. María Elisa Quinteros was supported by a doctoral scholarship by CONICYT Beca Doctorado Nacional No 21150801, Chile. We acknowledge Xavier Basagaña for his technical help, Payam Dadvand for his intellectual assistance, Gloria Icaza Noguera for reviewing the manuscript, and Estela Blanco for her help in reviewing English writing of the article.
Funding Information:
This work was supported as part of the project: “Impact of Wood Burning Air Pollution on Preeclampsia and other Pregnancy Outcomes in Temuco, Chile” ( DPI20140093 ) by CONICYT and Research Councils UK . Juana Maria Delgado-Saborit is supported by the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 750531 . María Elisa Quinteros was supported by a doctoral scholarship by CONICYT Beca Doctorado Nacional No 21150801 , Chile. We acknowledge Xavier Basagaña for his technical help, Payam Dadvand for his intellectual assistance, Gloria Icaza Noguera for reviewing the manuscript, and Estela Blanco for her help in reviewing English writing of the article.
Publisher Copyright:
© 2018 Elsevier Ltd
PY - 2019/3/1
Y1 - 2019/3/1
N2 - Missing data from air quality datasets is a common problem, but is much more severe in small cities or localities. This poses a great challenge for environmental epidemiology as high exposures to pollutants worldwide occur in these settings and gaps in datasets hinder health studies that could later inform local and international policies. Here, we propose the use of imputation methods as a tool to reconstruct air quality datasets and have applied this approach to an air quality dataset in Temuco, a mid-size city in Chile as a case-study. We attempted to reconstruct the database comparing five approaches: mean imputation, conditional mean imputation, K-Nearest Neighbor imputation, multiple imputation and Bayesian Principal Component Analysis imputation. As a base for the imputation methods, linear regression models were fitted for PM2.5 against other air quality and meteorological variables. Methods were challenged against validation sets where data was removed artificially. Imputation methods were able to reconstruct the dataset with good performance in terms of completeness, errors, and bias, even when challenged against the validations sets. The performance improved when including covariates from a second monitoring station in Temuco. K-Nearest Neighbor imputation showed slightly better performance than multiple imputation for error (25% vs. 27%) and bias (2.1% vs. 3.9%), but presented lower completeness (70% vs. 100%). In summary, our results show that the imputation methods can be a useful tool in reconstructing air quality datasets in a real-life situation.
AB - Missing data from air quality datasets is a common problem, but is much more severe in small cities or localities. This poses a great challenge for environmental epidemiology as high exposures to pollutants worldwide occur in these settings and gaps in datasets hinder health studies that could later inform local and international policies. Here, we propose the use of imputation methods as a tool to reconstruct air quality datasets and have applied this approach to an air quality dataset in Temuco, a mid-size city in Chile as a case-study. We attempted to reconstruct the database comparing five approaches: mean imputation, conditional mean imputation, K-Nearest Neighbor imputation, multiple imputation and Bayesian Principal Component Analysis imputation. As a base for the imputation methods, linear regression models were fitted for PM2.5 against other air quality and meteorological variables. Methods were challenged against validation sets where data was removed artificially. Imputation methods were able to reconstruct the dataset with good performance in terms of completeness, errors, and bias, even when challenged against the validations sets. The performance improved when including covariates from a second monitoring station in Temuco. K-Nearest Neighbor imputation showed slightly better performance than multiple imputation for error (25% vs. 27%) and bias (2.1% vs. 3.9%), but presented lower completeness (70% vs. 100%). In summary, our results show that the imputation methods can be a useful tool in reconstructing air quality datasets in a real-life situation.
KW - Air pollution
KW - Environmental epidemiology
KW - Missing data
KW - Multiple imputation
KW - Single imputation
KW - Wood-burning
UR - http://www.scopus.com/inward/record.url?scp=85058415433&partnerID=8YFLogxK
U2 - 10.1016/j.atmosenv.2018.11.053
DO - 10.1016/j.atmosenv.2018.11.053
M3 - Article
AN - SCOPUS:85058415433
SN - 1352-2310
VL - 200
SP - 40
EP - 49
JO - Atmospheric Environment
JF - Atmospheric Environment
ER -