Data integration of probability and nonprobability samples

Jean-FranÃ§ois Beaumont

Open Conference Systems, ITACOSM 2019 - Survey and Data Science

Jean-FranÃ§ois Beaumont

Building: Learning Center Morgagni
Room: Aula 209
Date: 2019-06-06 05:00 PM – 06:30 PM
Last modified: 2019-05-07

Abstract

The integration of data from a nonprobability source to data from a probability survey is a topic that has recently received a great deal of attention in the literature and that is currently being scrutinized by National Statistical Offices such as Statistics Canada. The main motivation for data integration is to reduce data collection costs and the response burden of a probability survey by either reducing its content (number of questions) or its sample size but, ideally, without sacrificing the quantity and quality of estimates produced by the survey program. Cost efficiencies can be achieved by using as much as possible data from nonprobability sources as they are typically much cheaper to acquire than data from a probability survey. However, estimates obtained only from nonprobability samples often suffer from significant selection biases. The goal of data integration methods is to reduce these biases without unduly increasing costs by taking advantage of both data sources.

Small Area Estimation techniques can be used to deal with small sample sizes. Instead, the focus of this talk is on methods that allow for a reduction of the survey content. We review three data integration methods that can be used to achieve this goal: i) calibration weighting of a nonprobability sample to estimated benchmarks from a probability survey; ii) sample matching; and iii) propensity score weighting of a nonprobability sample. All three methods require a vector of auxiliary variables available in both the probability and nonprobability samples. A rich vector of auxiliary variables is key for significantly reducing the selection bias of estimators obtained from a nonprobability sample. When many auxiliary variables are available, the choice of a relevant subset may become necessary. Some empirical results using real data will be presented to illustrate the effectiveness of the methods.

Full Text: SLIDES