Open Conference Systems, ITACOSM 2019 - Survey and Data Science

Font Size: 
Analysis of integrated data for Official Statistics
Li-Chun Zhang

Building: Learning Center Morgagni
Room: Aula Magna 327
Date: 2019-06-06 02:30 PM – 03:30 PM
Last modified: 2019-05-06


Methods for assessing and adjusting the representativeness of data from a single source, such as responses obtained from a sample survey or treatment results collected in an observational study, have traditionally received much attention. As are methods for dealing with the impact of any additional errors potentially caused by an imperfect measurement instrument or mechanism. The rapid uptake of administrative and transactions data use in official, scientific and commercial applications in the last decade, as well as the impact of the “Big Data Revolution", is leading to a growing diversification of primary data sources. Often, when data from multiple sources are combined to enable statistical inference, or to generate new statistical data for purposes that cannot be served by each source on its own, an important additional source of error emerges, corresponding to whether the data obtained from the study units in the combined data are in fact valid observations from the population of units that these data are supposed to represent. For an overarching perspective one may refer to this problem as the problem of inference under entity ambiguity.

From a data integration perspective, a situation of entity ambiguity can be characterised by the lack of an identified population set of target units or an observed subpopulation set of such units. In this talk, I present in particular some recent developments in three generic settings of entity ambiguity: statistical analysis and uses of linked datasets that may contain linkage errors; datasets created by a data fusion process, where joint statistical information is simulated using the information in marginal data from non-overlapping sources; and estimation of target population size when target units are partially and erroneously covered in each source. Emphasis will be given to statistical uncertainty and inference issues due to entity ambiguity that arises from integrating information from multiple sources.

Full Text: SLIDES