An empirical evaluation of latent class models for multisource statistics

Loredana Di Consiglio; Marco Di Zio; Danila Filipponi

Open Conference Systems, ITACOSM 2019 - Survey and Data Science

Loredana Di Consiglio, Marco Di Zio, Danila Filipponi

Building: Learning Center Morgagni
Room: Aula 209
Date: 2019-06-07 12:00 PM – 01:30 PM
Last modified: 2019-05-06

Abstract

The statistical production system of the Italian National Institute of Statistics (Istat) is moving towards a massive use of administrative data that together with sample surveys allow building a System of Integrated Registers (SIR). SIR will constitute the reference structure for any analysis and statistical output of the Institute. The system is composed of base registers that include the populations of interest for the different domains, and some important variables (core variables). In addition, SIR includes satellite/thematic registers that are characterized by providing information on specific subjects.

The core variables included in the registers may be used to obtain register-based statistics. These are obtained by a direct computation of the parameter on the values of the register. Values of the core variables are estimated by combining administrative and survey data. According to the traditional approach, that could be defined supervised, the survey data provide correct measures of the target variables and administrative data are used as auxiliary source of information. This is usually justified by the fact that the administrative measures may not correspond to the target variables. In this framework, a mass imputation approach can be adopted to obtain data for the whole set of units of the register. An example of this approach is the mass-imputation procedure for the estimation of the attained level of education of people (Di Cecco et al. 2018, Scholtus 2018).

In order to take into account deficiencies in both the measurement processes, an unsupervised approach can be adopted. In this context, unknown target variables are modeled as latent variables and the administrative and statistical sources are considered as imperfect measures of these latent variables. An application can be found in Filipponi et al., (2019) where the Authors propose to estimate the employment status using Labour Force and Social Security data. Given the longitudinal nature of the available information, they model the employment status as a latent variable through a Hidden Markov Model.

In this paper, we propose two pseudo-population bootstrap methods (see for instance, Chen et al. (2019)) for the estimation of the variance of the register-based statistics obtained by using latent class models. We provide an empirical evaluation of the variance estimation methods by means of a simulation-based study performed according to different scenarios. Scenarios are designed in order to study how different factors â€“ as for instance sampling rate, measurement errors, type of prediction â€“ may affect the algorithms proposed in the paper.

References

Chen S., Haziza D., LÃ©ger C., Mashreghi Z., (2019). Pseudo-population bootstrap methods for imputed survey data. Biometrika.

Di Cecco D., Di Laurea D., Di Zio M., Filippini R., Massoli P., Rocchetti G., (2018). Mass imputation of the attained level of education in the Italian System of Registers, Workshop on Statistical Data Editing (NeuchÃ¢tel, Switzerland, 18-20 September 2018)

Filipponi D., Guarnera U., Varriale R. Hidden Markov Models to Estimate Italian Employment Status. NTTS 2019, Bruxelles 11-13 March 2019

Scholtus S., (2018). Variances of Census Tables after Mass Imputation. CBS Discussion paper, December 2018.

Full Text: SLIDES