Building: Learning Center Morgagni

Room: Aula 209

Date: 2019-06-07 09:00 AM – 10:30 AM

Last modified: 2019-05-06

#### Abstract

Record linkage is a crucial step in the new approach of official statistics which is increasingly based on integration of different sources of data. The Italian National Statistical Institute is quickly moving toward an integrated statistical system based on registers fed by administrative and survey data. Main advantages include recognising relationship between different units (e.g. people and households), following changes over time and the availability of many variables regarding the same unit. The whole system depends on three main aspects: large amount of administrative data, unique identifiers supporting data integration and adequate computational and storage capabilities. Istat has been experienced on record linkage since half of â€˜90s. A combination of probabilistic and deterministic record linkage techniques is currently used for the Census coverage assessment, the construction of statistical registers and the combined usage of statistical and administrative data, e.g. for surveys on road traffic accident, labor force and households income. Â In Istat, probabilistic record linkage is mainly used to process subsets of data for which deterministic linkage failed due to errors in the identification codes. Even with this specific task, the need to process large datasets in Official statistics drives the Istat research to improve Â the efficiency of probabilistic methods. Â Here we report two real cases where our research activity attempts to give answer to problems came from the production. The first case study moves from the failure of probabilistic record linkage in estimating unbiased linkage probabilities due to the quadratic speed toward zero of the ratio between matches and non matches when the size of datasets to be linked grows. Filtering techniques help in this case, since they cut the set of pairs to be processed by eliminating from the analysis the pairs that are unlikely to be true matches. However they can exclude by mistake also an unknown amount of true matches from the analysis, without providing any estimate of this risk. To avoid filtering, we propose unbiased estimates based on mixture of multinomial distributions with structural zeros. The use of structural zeros improves the model robustness by removing unlikely matches while remaining in a fully probabilistic setting. The second example regards spatial analysis, where statistical units needs to be geo referenced by linking with addresses archives. In this case the units to be matched are addresses and key variables are strings, usually less than the three that represent the minimum number of variables needed to identify the probability model for linkage. Our approach models quantitative distances between strings through mixtures of continuous and categorical distributions rather than the usual latent class models. Encouraging preliminary results will be showed and discussed.