Graphical Structural Learning for Complex Survey Data

Daniela Marella; Paola Vicard

Open Conference Systems, ITACOSM 2019 - Survey and Data Science

Daniela Marella, Paola Vicard

Building: Learning Center Morgagni
Room: Aula 209
Date: 2019-06-06 12:00 PM – 01:30 PM
Last modified: 2019-05-06

Abstract

Bayesian Networks (BN) are multivariate statistical models satisfying sets of conditional independence statements contained in a direct acyclic graph (DAG). The network consists of two components. The first component is a DAG where each node corresponds to a random variable, while edges represent direct dependencies.The second component is the set of all parameters in the network.

In recent years BNs have been successfully applied to a large variety of contexts; among them official statistics. BNs appeared to be very useful in missing item imputation, contingency table estimation for complex survey sampling and measurement errors. However, there are still some obstacles complicating a wider application in official statistics contexts. In fact, missing item imputation and measurement error correction can be performed once the BN structure is availableÂ (either known in advance or learned from data) since information has to be propagated throughout the network. Therefore, it is necessary to develop structural learning algorithms accounting for the sampling design complexity.

Learning BNs from aÂ sample can be a time consuming task and a challenging issue even when data are independent and identically distribuited (i.i.d). PC is one of the most known procedures for Bayesian networks structural learning. It has several advantages, among which an intuitive basis .

The original PC algorithm is composed of two phases: in the former the skeleton is estimated, in the latter the orientation of the arrows is identified.

In the first phase, the PC algorithm uses conditional independence tests usually performed using the standard Pearson-chi squared test statistic under i.i.d. assumption, which is equivalent to simple random sampling assumption.

However, sample selection in surveys involves more complex sampling designs based on stratification, different level of clustering and inclusion probabilities proportional to an appropriate size meaure; in such circumstances the standard test procedure is not valid even asymptotically.

In order to avoid misleading results about the true causal structure due to the change of independence and conditional independence relationships induced by the sampling design, a modified version of the PC algorithm is proposed. In the PC algorithm form complex survey data the skeleton learning phase of the PC algorithm is modified introducing a procedure for testing association ina two-way table for data coming from complex sample surveys. The limiting sampling distribution of the test statistic under the independence null hypothesis is estimated resorting to resampling methods for finite population.