Improving Data Validation using Machine Learning

Christian Ruiz

Open Conference Systems, ITACOSM 2019 - Survey and Data Science

Christian Ruiz

Building: Learning Center Morgagni
Room: Aula 209
Date: 2019-06-06 09:00 AM – 10:30 AM
Last modified: 2019-05-06

Abstract

1. Introduction

The aim of this project is to extend and speed up data validation at the Swiss Federal Statistical Office (FSO) by means of machine learning algorithms and to improve data quality.

Statistical offices carry out data validation (DV) to check the quality and reliability of administrative data and survey data. Data that are likely to be incorrect are sent back to data suppliers with a correction request. Until now, such DV have mainly been carried out at two different levels: either through manual checks or automated processes using threshold values and logical tests. This process of two-way â€œplausibility checksâ€ involves a great deal of work. In some cases, staff is required to manually check the data again, in other cases rules are applied that often require additional checks. This rule-based approach has developed from previous experience but is not necessarily exhaustive and always precise. Machine learning has the potential to ensure faster and more accurate checks.

This project is one of the five (pilot) projects currently being developed in line with FSOâ€™s data innovation strategyÂ with the goal to augment and/or complement the existing basic official statistical production at the FSO.

2. Methods 2.1. Using innovative new ways of machine learning to find alleged mistakes

This approach would rely on a machine learning algorithm using historical data first. Based on previous analysis, a target variable can be defined that should be able to be predicted by the algorithm. Only then can the algorithm be used for the prediction. As the final stage, the predicted and actual values of the target variables are compared and the predictive accuracy can be evaluated.

2.2. Using innovative methods to explain the alleged mistakes

In the second part of the project, a feedback mechanism is used to send an automatic explanation to data suppliers. This is necessary, as it is impossible to combine high-prediction performance and interpretability with the same algorithm. Thus while we achieved a strong prediction in the first part, the same algorithm can not serve for explanation. In the second part we thus build a feedback mechanism, a â€˜local explanationâ€™, to open the â€œblack-boxâ€.

3. Conclusions

It will be the first time that we will publicly present the initial results of the ongoing (pilot) project. The results of the first part of the project are convincing and show us that those cases that have a high-predicted posterior probability to be wrong seem indeed wrong through manual checks. Further, a comparison of the new and the previous DV methods indicates an improvement of the data quality. The second part related to the feedback mechanism is still more experimental and we found different possible alternatives of generating â€˜local explanationsâ€™ to provide an automated feedback.

Full Text: SLIDES