Use of data imputation tools to reconstruct incomplete air quality datasets: A case-study in Temuco, Chile

María Elisa Quinteros, Siyao Lu, Carola Blazquez, Juan Pablo Cárdenas-R, Ximena Ossa, Juana María Delgado-Saborit, Roy M. Harrison, Pablo Ruiz-Rudolph

Resultado de la investigación: Article

2 Citas (Scopus)

Resumen

Missing data from air quality datasets is a common problem, but is much more severe in small cities or localities. This poses a great challenge for environmental epidemiology as high exposures to pollutants worldwide occur in these settings and gaps in datasets hinder health studies that could later inform local and international policies. Here, we propose the use of imputation methods as a tool to reconstruct air quality datasets and have applied this approach to an air quality dataset in Temuco, a mid-size city in Chile as a case-study. We attempted to reconstruct the database comparing five approaches: mean imputation, conditional mean imputation, K-Nearest Neighbor imputation, multiple imputation and Bayesian Principal Component Analysis imputation. As a base for the imputation methods, linear regression models were fitted for PM2.5 against other air quality and meteorological variables. Methods were challenged against validation sets where data was removed artificially. Imputation methods were able to reconstruct the dataset with good performance in terms of completeness, errors, and bias, even when challenged against the validations sets. The performance improved when including covariates from a second monitoring station in Temuco. K-Nearest Neighbor imputation showed slightly better performance than multiple imputation for error (25% vs. 27%) and bias (2.1% vs. 3.9%), but presented lower completeness (70% vs. 100%). In summary, our results show that the imputation methods can be a useful tool in reconstructing air quality datasets in a real-life situation.

Idioma originalEnglish
Páginas (desde-hasta)40-49
Número de páginas10
PublicaciónAtmospheric Environment
Volumen200
DOI
EstadoPublished - 1 mar 2019

Huella dactilar

air quality
epidemiology
principal component analysis
method
pollutant
city

ASJC Scopus subject areas

  • Environmental Science(all)
  • Atmospheric Science

Citar esto

Quinteros, María Elisa ; Lu, Siyao ; Blazquez, Carola ; Cárdenas-R, Juan Pablo ; Ossa, Ximena ; Delgado-Saborit, Juana María ; Harrison, Roy M. ; Ruiz-Rudolph, Pablo. / Use of data imputation tools to reconstruct incomplete air quality datasets : A case-study in Temuco, Chile. En: Atmospheric Environment. 2019 ; Vol. 200. pp. 40-49.
@article{e3328657f6ea4cc18037d14044b656c5,
title = "Use of data imputation tools to reconstruct incomplete air quality datasets: A case-study in Temuco, Chile",
abstract = "Missing data from air quality datasets is a common problem, but is much more severe in small cities or localities. This poses a great challenge for environmental epidemiology as high exposures to pollutants worldwide occur in these settings and gaps in datasets hinder health studies that could later inform local and international policies. Here, we propose the use of imputation methods as a tool to reconstruct air quality datasets and have applied this approach to an air quality dataset in Temuco, a mid-size city in Chile as a case-study. We attempted to reconstruct the database comparing five approaches: mean imputation, conditional mean imputation, K-Nearest Neighbor imputation, multiple imputation and Bayesian Principal Component Analysis imputation. As a base for the imputation methods, linear regression models were fitted for PM2.5 against other air quality and meteorological variables. Methods were challenged against validation sets where data was removed artificially. Imputation methods were able to reconstruct the dataset with good performance in terms of completeness, errors, and bias, even when challenged against the validations sets. The performance improved when including covariates from a second monitoring station in Temuco. K-Nearest Neighbor imputation showed slightly better performance than multiple imputation for error (25{\%} vs. 27{\%}) and bias (2.1{\%} vs. 3.9{\%}), but presented lower completeness (70{\%} vs. 100{\%}). In summary, our results show that the imputation methods can be a useful tool in reconstructing air quality datasets in a real-life situation.",
keywords = "Air pollution, Environmental epidemiology, Missing data, Multiple imputation, Single imputation, Wood-burning",
author = "Quinteros, {Mar{\'i}a Elisa} and Siyao Lu and Carola Blazquez and C{\'a}rdenas-R, {Juan Pablo} and Ximena Ossa and Delgado-Saborit, {Juana Mar{\'i}a} and Harrison, {Roy M.} and Pablo Ruiz-Rudolph",
year = "2019",
month = "3",
day = "1",
doi = "10.1016/j.atmosenv.2018.11.053",
language = "English",
volume = "200",
pages = "40--49",
journal = "Atmospheric Environment",
issn = "1352-2310",
publisher = "Elsevier Limited",

}

Quinteros, ME, Lu, S, Blazquez, C, Cárdenas-R, JP, Ossa, X, Delgado-Saborit, JM, Harrison, RM & Ruiz-Rudolph, P 2019, 'Use of data imputation tools to reconstruct incomplete air quality datasets: A case-study in Temuco, Chile', Atmospheric Environment, vol. 200, pp. 40-49. https://doi.org/10.1016/j.atmosenv.2018.11.053

Use of data imputation tools to reconstruct incomplete air quality datasets : A case-study in Temuco, Chile. / Quinteros, María Elisa; Lu, Siyao; Blazquez, Carola; Cárdenas-R, Juan Pablo; Ossa, Ximena; Delgado-Saborit, Juana María; Harrison, Roy M.; Ruiz-Rudolph, Pablo.

En: Atmospheric Environment, Vol. 200, 01.03.2019, p. 40-49.

Resultado de la investigación: Article

TY - JOUR

T1 - Use of data imputation tools to reconstruct incomplete air quality datasets

T2 - A case-study in Temuco, Chile

AU - Quinteros, María Elisa

AU - Lu, Siyao

AU - Blazquez, Carola

AU - Cárdenas-R, Juan Pablo

AU - Ossa, Ximena

AU - Delgado-Saborit, Juana María

AU - Harrison, Roy M.

AU - Ruiz-Rudolph, Pablo

PY - 2019/3/1

Y1 - 2019/3/1

N2 - Missing data from air quality datasets is a common problem, but is much more severe in small cities or localities. This poses a great challenge for environmental epidemiology as high exposures to pollutants worldwide occur in these settings and gaps in datasets hinder health studies that could later inform local and international policies. Here, we propose the use of imputation methods as a tool to reconstruct air quality datasets and have applied this approach to an air quality dataset in Temuco, a mid-size city in Chile as a case-study. We attempted to reconstruct the database comparing five approaches: mean imputation, conditional mean imputation, K-Nearest Neighbor imputation, multiple imputation and Bayesian Principal Component Analysis imputation. As a base for the imputation methods, linear regression models were fitted for PM2.5 against other air quality and meteorological variables. Methods were challenged against validation sets where data was removed artificially. Imputation methods were able to reconstruct the dataset with good performance in terms of completeness, errors, and bias, even when challenged against the validations sets. The performance improved when including covariates from a second monitoring station in Temuco. K-Nearest Neighbor imputation showed slightly better performance than multiple imputation for error (25% vs. 27%) and bias (2.1% vs. 3.9%), but presented lower completeness (70% vs. 100%). In summary, our results show that the imputation methods can be a useful tool in reconstructing air quality datasets in a real-life situation.

AB - Missing data from air quality datasets is a common problem, but is much more severe in small cities or localities. This poses a great challenge for environmental epidemiology as high exposures to pollutants worldwide occur in these settings and gaps in datasets hinder health studies that could later inform local and international policies. Here, we propose the use of imputation methods as a tool to reconstruct air quality datasets and have applied this approach to an air quality dataset in Temuco, a mid-size city in Chile as a case-study. We attempted to reconstruct the database comparing five approaches: mean imputation, conditional mean imputation, K-Nearest Neighbor imputation, multiple imputation and Bayesian Principal Component Analysis imputation. As a base for the imputation methods, linear regression models were fitted for PM2.5 against other air quality and meteorological variables. Methods were challenged against validation sets where data was removed artificially. Imputation methods were able to reconstruct the dataset with good performance in terms of completeness, errors, and bias, even when challenged against the validations sets. The performance improved when including covariates from a second monitoring station in Temuco. K-Nearest Neighbor imputation showed slightly better performance than multiple imputation for error (25% vs. 27%) and bias (2.1% vs. 3.9%), but presented lower completeness (70% vs. 100%). In summary, our results show that the imputation methods can be a useful tool in reconstructing air quality datasets in a real-life situation.

KW - Air pollution

KW - Environmental epidemiology

KW - Missing data

KW - Multiple imputation

KW - Single imputation

KW - Wood-burning

UR - http://www.scopus.com/inward/record.url?scp=85058415433&partnerID=8YFLogxK

U2 - 10.1016/j.atmosenv.2018.11.053

DO - 10.1016/j.atmosenv.2018.11.053

M3 - Article

AN - SCOPUS:85058415433

VL - 200

SP - 40

EP - 49

JO - Atmospheric Environment

JF - Atmospheric Environment

SN - 1352-2310

ER -