Open Conference Systems, ITACOSM 2019 - Survey and Data Science

Font Size: 
A Bayesian approach for multiple regression with linked and deduplicated data
Brunero Liseo, Rebecca C. Steorts, Andrea Tancredi

Building: Learning Center Morgagni
Room: Aula 209
Date: 2019-06-07 09:00 AM – 10:30 AM
Last modified: 2019-05-06


We propose a Bayesian approach for performing record linkage and regression acrossarbitrarily many lists, while simultaneously considering duplicate detection. We frame the linkage problem as a clustering task, where similar records are clustered to truelatent individuals. We propose a statistical model to incorporate both the linkingprocess and the inferential process, including the features of the record as well as thevariables needed for inference. Paramount to our approach is the key observation thatthe prior over the space of linkages can be written as a random partition model. Inparticular, the Pitman-Yor process will be used as the prior distribution regarding thecluster assignment of records. Through a joint modeling of the record linkage and the inferential processes, one is able to account for the matching uncertainty in the inferential procedures based on linked data. 
Moreover, one is able to generate a feedback mechanism of the information provided by the working statistical model on the record linkage process. This feedback mechanism is essential for eliminatin g potential biases that can jeopardize the resulting post-linkage inference. We apply our methodology to the case of multiple regression, and illustrate empirically that the feedback mechanism improves the performance of the record linkage process.

Full Text: SLIDES