Deduplication and population size estimation

Andrea Tancredi; Brunero Liseo

Open Conference Systems, ITACOSM 2019 - Survey and Data Science

Andrea Tancredi, Brunero Liseo

Building: Learning Center Morgagni
Room: Aula 210
Date: 2019-06-07 12:00 PM – 01:30 PM
Last modified: 2019-05-06

Abstract

Data de-duplication is the process of detecting records in one or more datasets which refer to the same entity. In this paper we illustrate the relationships between de-duplication problems and capture-recapture models. In particular we tackle the de-duplication process via a latent entity model, where the observed data comprise clusters of perturbed versions of a set of key variables drawn from a finite population of N different entities. WeÂ consider the population size N as an unknown model parameter. As a result, we are able to account for the de-duplication uncertainty in the population size estimation. We apply our method to two synthetic data sets comprising German names. In addition we illustrate a real data application, where we match records from two lists which report information about people killed in the recent Syrian conflict.

Full Text: SLIDES