Estimating population census tables and their accuracy using Multiple Imputation of Latent Classes (MILC) with multi-source data

Laura Boeschoten; Sander Scholtus; Ton de Waal; Jacco Daalmans; Jeroen Vermunt

Open Conference Systems, ITACOSM 2019 - Survey and Data Science

Laura Boeschoten, Sander Scholtus, Ton de Waal, Jacco Daalmans, Jeroen Vermunt

Building: Learning Center Morgagni
Room: Aula 209
Date: 2019-06-07 12:00 PM – 01:30 PM
Last modified: 2019-05-06

Abstract

Statistics Netherlands can rely on population registers as the main data source for most tables in the decennial population and housing census. As not all variables required for the census can be found in registers, the data are supplemented with variables originating from the Labor Force Survey. To resolve inconsistencies on the micro level, the available register and survey variables are combined in a micro-integration process. To prevent inconsistencies during estimation, the so-called repeated weighting method is applied.

This current approach has some theoretical drawbacks. First, to resolve inconsistencies during micro-integration it is usually assumed that either the register variable or the survey variable is correct. However, in practice, neither source is completely free of error. Secondly, the uncertainty due to missing and conflicting values in the original data is not incorporated in the variance estimates of the estimated census tables. Thirdly, the current procedure involves a specific sequence of process steps, where decisions made at each step influence later steps. This may lead to estimated tables that are sub-optimal in terms of bias and variance, given the available data.

Boeschoten et al. (2017) described a method that circumvents the above issues by combining Multiple Imputation and Latent Class analysis (MILC). The main goal of the MILC method is to correct for misclassification by using measures of the same variable originating from different sources (e.g., a register and a survey) that are linked on the unit level. The observed variables are used as indicators in a latent class model, where the latent variable represents an error-free version of the underlying variable of interest. After the model has been estimated, multiple imputed versions of the variable of interest are created, thereby correcting for the estimated misclassification. The differences between the multiple imputations reflect the uncertainty due to missing and conflicting values which can be incorporated in an estimate of the total variance, using similar formulas as in traditional multiple imputation.

The aim of the current study was to investigate whether the MILC method could be used to estimate population census tables. Theoretically, an adjustment to the method was needed to accommodate finite-population estimation, under the assumption that the register data completely cover the target population. Practically, we wanted to test whether the method can handle several latent variables simultaneously with many edit restrictions between them.

To evaluate the performance of the MILC method, a simulation study was conducted on a real table from the Dutch 2011 census, to which artificial misclassifications and missing data were added. The results show that the method could correct for bias due to misclassification, both for marginal distributions and more detailed cross-classifications. The MILC method also produced reasonable variance estimates in general, although variances tended to be overestimated for cells with relatively small frequencies.

Reference: L. Boeschoten, D. Oberski & T. de Waal (2017), Estimating classification errors under edit restrictions in composite survey-register data using Multiple Imputation Latent Class modelling (MILC). Journal of Official Statistics 33, 921â€“962.

Full Text: SLIDES