Handbook on Data Validation in Eurostat
1. INTRODUCTION A main goal of any statistical organization is the dissemination of high-quality information and this is particularly true in Eurostat. Quality implies that the data available to users have the ability to satisfy their needs and requirements concerning statistical information and is defined in a multidimensional way involving six criteria: Relevance, Accuracy, Timeliness and punctuality, Accessibility and clarity, Comparability and Coherence. Broadly speaking, data validation may be defined as supporting all the other steps of the data production process in order to improve the quality of statistical information. In the Handbook on improving quality by analysis of process variables (LEG on Quality project by ONS UK, Statistics Sweden, National Statistical Service of Greece, and INE PT) it is described as the method of detecting errors resulting from data collection. In short, it is designed to check plausibility of the data and to correct possible errors and is one of the most complex operations in the life cycle of statistical data, including steps and procedures of two main categories: checks (or edits) and transformations (or imputations). Its three main components are the following: Data editing The application of checks that identify missing, invalid or inconsistent
entries or that point to data records that are potentially in error. . Missing data and imputation Analysis of imputation and reweighting methods used to
correct for missing data caused by non-response. Non-response can be total, when there is no information on a given respondent (unit non-response), or partial, when only part of the information on the respondent is missing (item non-response). Imputation is a procedure used to estimate and replace missing or inconsistent (unusable) data items in order to provide a complete data set.
Advanced validation Advanced statistical methods can be used to improve data quality. Many of them are related to outlier detection since the conclusions and inferences obtained from a contaminated (by outliers) data set may be seriously biased.
Before Eurostat dissemination, data validation has to be performed at different stages depending on who is processing the data: The first stage is at the end of the collection phase and concerns micro data. Member States
are responsible for it, since they conduct the surveys. The second stage concerns country data, i.e., the micro-data country aggregates sent by
Member States to Eurostat. Validation has to be performed by the latter at this stage. The third and last stage concerns aggregate (Eurostat) data before their dissemination and
it is also performed by Eurostat.
Validation should be performed according to a set of common (to what ? all sources and records? For one application or for all) and specific rules depending on the stage and on the data aggregation level. In this document, some general and common guidance are provided for each stage. More detailed rules and procedures can only be provided when looking at a specific survey, i.e., since each one has its own particular characteristics and problems. A thorough set of validation guidelines can only then be defined for a specific statistical project. Nevertheless, this document intends to discuss the most important issues that arise concerning validation of any statistical data set, describing its main problems and how to handle them. It lists as thoroughly as possible the different aspects that need to be analyzed for error diagnostic and checking, the most adequate methods and procedures for that purpose and