Font Size:
Reconstructing missing data sequences in multivariate time series: an application to environmental data
Last modified: 2018-05-18
Abstract
Missing data arise in many statistical analyses and can have a significant effect on the conclusions that can be drawn from the data. In environmental data a common approach usually adopted by the Environmental Protection Agencies to handle missing values is by deleting those observations with incomplete information from the study, obtaining a massive underestimation of a lot of indexes usually used for evaluating air quality. In multivariate time series it may happen that not only isolated values but also long sequences of some of the time series' components may miss. We propose an new procedure that aims to reconstruct the missing sequences by exploiting the spatial correlation and the serial correlation of the multivariate time series. The proposed procedure is based on a spatial dynamic model and imputes the missing values in the time series basing on a linear combination of the neighbor contemporary observations and their lagged values. It is oriented to spatio temporal data, although it is general to be applied to generic stationary multivariate time series. The procedure has been applied to the pollution data with a remarkably satisfactory performance.
References
1. Biggeri, A., Baccini, M., Accetta, G., Lagazio, C.: Estimates of short-term effects of air pollutants in Italy. Epidemiologia e Prevenzione 26, 203205 (2002).
2. Calculli, C., Fassò, A., Finazzi, F., Pollice, A., Turnone, A.: Maximum likelihood estimation of the multivariate hidden dynamic geostatistical model with application to air quality in Apulia, Italy. Environmetrics 26, 406-417 (2015).
3. Liu, S., Molenaar, P.C.: iVAR: a program for imputing missing data in multivariate time series using vector autoregressive models. Behav. Res. Method. 46 4, 1138-1148 (2014).
4. Pollice, A., Lasinio, G.J.: Two approaches to imputation and adjustment of air quality data from a composite monitoring network. J. Data Scie. 7, 43-59 (2009).
5. Oehmcke, S., Zielinski, O., Kramer O.: kNN Ensembles with Penalized DTW for Multivariate Time Series Imputation. In: International Joint Conference on Neural Networks (IJCNN), IEEE, (2016).
6. Honaker, J., King, G., Blackwell, M.: Amelia II: A Program for Missing Data. J. Stat. Software 45 7, 1-47 (2011).
7. Anselin, L: Spatial econometrics: methods and models. Kluwer Academic, The Netherlands (1988).
8. van Buuren, S., Groothuis-Oudshoorn, K.: mice: Multivariate Imputation by Chained Equations in R. J. Stat. Software 45 3, 1-67 (2011).
9. Kowarik, A., Templ, M.: Imputation with the R package VIM. J. Stat. Software 74 7, 1-16 (2016).
10. Josse, J., Husson, F.: missMDA: A package for handling missing values in multivariate data analysis. J. Stat. Software 70 1, 1-31 (2016).
11. Moritz, S., Bartz-Beielstein, T.: imputeTS: Time Series Missing Value Imputation in R. To appear on The R Journal (2017).
12. Raaschou-Nielsen, O., Andersen, Z.J., Beelen, R., Samoli, E., Stafoggia, M., Weinmayr, G. et al.: ir pollution and lung cancer incidence in 17 European cohorts: prospective analyses from the European Study of Cohorts for Air Pollution Effects (ESCAPE). The Lancet Oncology 14 9, 813-822 (2013).
13. Aga, E., Samoli, E., Touloumi, G., Anderson, H.R., Cadum, E., Forsberg, B. et al.: Short-term effects of ambient particles on mortality in the elderly: results from 28 cities in the APHEA2 project. Eur. Resp. J. Suppl. 40, 28s33s (2003).
14. Dou, B., Parrella, M.L., Yao, Q.: Generalized Yule-Walker Estimation for Spatio-Temporal Models with Unknown Diagonal Coefficients. J. Econometrics 194, 369-382 (2016).
15. Fitri, M.D.N.F., Ramli, N.A., Yahaya, A.S., Sansuddin, N., Ghazali, N.A., Al Madhoun, W.: Monsoonal differences and probability distribution of PM10 concentration. Environ. Monit. Assess. 163, 655-667 (2010).
16. Norazian, M.N., Shukri, Y.A., Azam, R.N., Mustafa Al Bakri, A.M.:Â Estimation of missing values in air pollution data using single imputation techniques. ScienceAsia 34, 341-345 (2008).
17. Cameletti, M., Ignaccolo, R., Bande, S.: Comparing spatio-temporal models for particulate matter in Piemonte. Environmetrics 22, 985996 (2011).
18. Junninen, H., Niska, H., Tuppurrainen, K. and Ruuskanen, J. and Kolehmainen, M: Methods for imputation of missing values in air quality data sets. Atmospheric Environment 38, 2895-2907 (2004).
19. Lee, L.F., Yu, J.: Estimation of spatial autoregressive panel data models with fixed effects. J. Econometrics 154, 165-185 (2010).
20. Lee, L.F., Yu, J.: Some recent developments in spatial panel data models. Reg. Sci. Urban Econ. 40, 255-271 (2010).
2. Calculli, C., Fassò, A., Finazzi, F., Pollice, A., Turnone, A.: Maximum likelihood estimation of the multivariate hidden dynamic geostatistical model with application to air quality in Apulia, Italy. Environmetrics 26, 406-417 (2015).
3. Liu, S., Molenaar, P.C.: iVAR: a program for imputing missing data in multivariate time series using vector autoregressive models. Behav. Res. Method. 46 4, 1138-1148 (2014).
4. Pollice, A., Lasinio, G.J.: Two approaches to imputation and adjustment of air quality data from a composite monitoring network. J. Data Scie. 7, 43-59 (2009).
5. Oehmcke, S., Zielinski, O., Kramer O.: kNN Ensembles with Penalized DTW for Multivariate Time Series Imputation. In: International Joint Conference on Neural Networks (IJCNN), IEEE, (2016).
6. Honaker, J., King, G., Blackwell, M.: Amelia II: A Program for Missing Data. J. Stat. Software 45 7, 1-47 (2011).
7. Anselin, L: Spatial econometrics: methods and models. Kluwer Academic, The Netherlands (1988).
8. van Buuren, S., Groothuis-Oudshoorn, K.: mice: Multivariate Imputation by Chained Equations in R. J. Stat. Software 45 3, 1-67 (2011).
9. Kowarik, A., Templ, M.: Imputation with the R package VIM. J. Stat. Software 74 7, 1-16 (2016).
10. Josse, J., Husson, F.: missMDA: A package for handling missing values in multivariate data analysis. J. Stat. Software 70 1, 1-31 (2016).
11. Moritz, S., Bartz-Beielstein, T.: imputeTS: Time Series Missing Value Imputation in R. To appear on The R Journal (2017).
12. Raaschou-Nielsen, O., Andersen, Z.J., Beelen, R., Samoli, E., Stafoggia, M., Weinmayr, G. et al.: ir pollution and lung cancer incidence in 17 European cohorts: prospective analyses from the European Study of Cohorts for Air Pollution Effects (ESCAPE). The Lancet Oncology 14 9, 813-822 (2013).
13. Aga, E., Samoli, E., Touloumi, G., Anderson, H.R., Cadum, E., Forsberg, B. et al.: Short-term effects of ambient particles on mortality in the elderly: results from 28 cities in the APHEA2 project. Eur. Resp. J. Suppl. 40, 28s33s (2003).
14. Dou, B., Parrella, M.L., Yao, Q.: Generalized Yule-Walker Estimation for Spatio-Temporal Models with Unknown Diagonal Coefficients. J. Econometrics 194, 369-382 (2016).
15. Fitri, M.D.N.F., Ramli, N.A., Yahaya, A.S., Sansuddin, N., Ghazali, N.A., Al Madhoun, W.: Monsoonal differences and probability distribution of PM10 concentration. Environ. Monit. Assess. 163, 655-667 (2010).
16. Norazian, M.N., Shukri, Y.A., Azam, R.N., Mustafa Al Bakri, A.M.:Â Estimation of missing values in air pollution data using single imputation techniques. ScienceAsia 34, 341-345 (2008).
17. Cameletti, M., Ignaccolo, R., Bande, S.: Comparing spatio-temporal models for particulate matter in Piemonte. Environmetrics 22, 985996 (2011).
18. Junninen, H., Niska, H., Tuppurrainen, K. and Ruuskanen, J. and Kolehmainen, M: Methods for imputation of missing values in air quality data sets. Atmospheric Environment 38, 2895-2907 (2004).
19. Lee, L.F., Yu, J.: Estimation of spatial autoregressive panel data models with fixed effects. J. Econometrics 154, 165-185 (2010).
20. Lee, L.F., Yu, J.: Some recent developments in spatial panel data models. Reg. Sci. Urban Econ. 40, 255-271 (2010).
Full Text:
PDF