Big Data for Finite Population Inference: Calibrating Pseudo-Weights

Ali Rafei; Carol Flannagan; Michael R Elliott

Open Conference Systems, ITACOSM 2019 - Survey and Data Science

Ali Rafei, Carol Flannagan, Michael R Elliott

Building: Learning Center Morgagni
Room: Aula 210
Date: 2019-06-05 04:20 PM – 06:00 PM
Last modified: 2019-05-23

Abstract

So-called â€œBig Dataâ€ sources are often non-probability samples such as collection of web surveys, administrative records, or other sources selected through non-random mechanism. Therefore, adjustments for the potential selection bias are critical in Big Data to make finite population inference. In the absence of a set of strong predictors that properly explain the selection mechanism, we have proposed use of Bayesian additive regression trees to use large sets of individually weak covariates to estimate pseudo-inclusion probabilities of the elements in Big Data using a benchmark probability survey. Calibration to the control totals based on model-assisted methods is another approach that mitigates selection bias by forcing the weighted total of auxiliary variables to equal population totals. The general regression (GREG) estimator is a common model-assisted method in which a linear model is used to predict the outcome variable based on auxiliary variables. In this study, we further adjust for selection bias in the naturalistic driving big data of Safety Pilot Model Deployment by calibrating pseudo-weights using a GREG method. We employ the National Household Travel Survey data as a benchmark to estimate both pseudo-weights and the population totals of auxiliary variables. In addition, a conditional variance method is developed to incorporate the variability in both analytic variables, pseudo-weights and estimated totals into the variance estimation. Our simulation results indicate that further improvement in selection bias can be achieved by calibration if the outcome model is correctly specified, we show how accurate our proposed method estimates the variance under pseudo-weighting.

Full Text: SLIDES