Producing contingency table estimates integrating survey data and Big Data

Paolo Righi; Giulio Barcaroli; Mariagrazia Rinaldi; Gianpiero Bianchi

Open Conference Systems, ITACOSM 2019 - Survey and Data Science

Paolo Righi, Giulio Barcaroli, Mariagrazia Rinaldi, Gianpiero Bianchi

Building: Learning Center Morgagni
Room: Aula 209
Date: 2019-06-06 09:00 AM – 10:30 AM
Last modified: 2019-05-06

Abstract

The coming of Big Data has been producing a paradigm shift in making inference in official statistics. The joined application of Data Science techniques as Text Processing and Machine Learning (ML), are suitable for treating data characterized by high volume and variety as Big Data. Making use of these techniques, inference becomes (supervised or unsupervised) model based. On the other hand, National Statistical Institutes (NSIs) commonly apply the classical design based approach according to a controlled process of collecting data, namely the survey sampling data. No problem occurs when the model based and design based approaches are used in independent statistical processes but in some occasions, they could be used jointly.

In this paper we consider the statistics currently produced by the Istat survey on the use of Information and Communication Technology (ICT) by the Italian enterprises. The survey collects several variables by a questionnaire. Since 2017, Istat produces also experimental statistics on a limited number of variables using text processing and ML techniques, exploiting the scraped information of the enterprise websites. Quality evaluation studies have shown that the statistics based on Internet data are more reliable in terms of accuracy than the survey data based estimates, but this innovative approach can be employed for estimating the parameters related only to three target variables: the presence of online web-ordering facility in the website, the use of the social media by the enterprise, and the presence of job advertising in the website. Traditional survey data based process produces the estimate of the remaining ICT target parameters.

In this general context, concern arises when estimating contingency tables where at least one of the three variables predicted by Internet data appears along with other variables of the questionnaire. In particular, the ICT survey has to produce a couple of tables for different domains of interest, crossing the presence of the online web ordering facility variable with others surveyed variables. In this case, the use of the survey weights for estimating the two-way distributions stresses the coherence issue between the marginal distributions of the contingency tables with respect to the stand alone distributions of the predicted by Internet data variable. Commonly, we expect different distribution estimates. We propose to introduce the model-based estimates of the frequencies of the presence of the online web-ordering facilities as calibration constraints. Calibration is already employed in the current estimate process and two drawbacks could affect the feasibility of the approach: new constraints might hamper the convergence of the procedure; the accuracy of the estimates could worsen and need to be carefully evaluated.

Integration of survey data and Big Data is a concrete example of statistical problems that NSIs have to be dealt with for including Big Data in the process, and the paper gives a first contribution in terms of possible solutions.

Full Text: SLIDES