Font Size: 
Quality issues when using Big Data in Official Statistics
Paolo Righi, Giulio Barcaroli, Natalia Golini

Last modified: 2017-05-24


The opportunities of producing enhanced statistics and the declining budgets, make using Big Data (BD) in National Statistical Offices (NSO) appealing. Often the debate on these sources is focused on volume, velocity, variety and on IT capability to capture, store, process and analyze BD for statistical production. Nevertheless, other vs have to be taken into account, especially in the NSOs, such as veracity (data quality as selectivity and trustworthiness of the information) and validity (data correct and accurate for the intended use). Veracity and validity affect the accuracy (bias and variance) of the estimates and, therefore, question if  high amount of data produces necessarily high quality statistics. This paper evaluates the condition in which an approach using BD, suffering from selectivity and validity concerns, is competitive with a survey sampling estimator. A simulation study based on real data has been carried out. Starting from the Italian business register, a synthetic enterprise population with web sites has been set up. Target and scraped from the web site variables have been built according to the distributions observed in Istat ICT survey data. Design based estimators and supervised model based estimators using scraped data are compared in terms of bias and MSE.