26
Assessing Quality for Integration Based Data M. Denk, W. Grossmann Institute for Scientific Computing

Assessing Quality for Integration Based Data

  • Upload
    amadis

  • View
    37

  • Download
    0

Embed Size (px)

DESCRIPTION

Assessing Quality for Integration Based Data. M. Denk, W. Grossmann Institute for Scientific Computing. Contents. Introduction Data Generating Processes Data Quality for Integration Based Production Assessing Quality for Integration Based Data Conclusions. - PowerPoint PPT Presentation

Citation preview

Page 1: Assessing Quality for  Integration Based Data

Assessing Quality for Integration Based Data

M. Denk, W. Grossmann

Institute for Scientific Computing

Page 2: Assessing Quality for  Integration Based Data

Contents

• Introduction• Data Generating Processes• Data Quality for Integration Based Production• Assessing Quality for Integration Based Data• Conclusions

Page 3: Assessing Quality for  Integration Based Data

Introduction – Aspects of Quality

• Quality is discussed from two different points of view The Processing View

• What methods can be used in production of statistics ? Specific statistical techniques for specific statistics

• Development of models of best practice or standards

The Reporting View• How should Quality reports look like?

Page 4: Assessing Quality for  Integration Based Data

Introduction – Reporting View

• Numerous formats for Quality Reports SDSS, DQAF, Fed Stats, StaCan,….

• Logic of the proposals according to so called

hyperdimensions

– For example ESS:• Institutional Arrangements• Core Statistical Processes• Dimensions for Statistical Output

– Inside the hyperdimensions so called quality dimensions• Relevance, Accuracy, Timeliness, Accessibility,……

Page 5: Assessing Quality for  Integration Based Data

Introduction – Reporting View

• Not so much agreement about the dimensions• Possible Reason: Different methods / levels of

Conceptualization

– Concepts of mental entities• e.g. quality dimensions in DQAF

– Concepts as meaning of general terms • e.g. quality elements in DQAF

– Concepts as units of knowledge• e.g. quality indicators of DQAF

– Concepts as abstracts of kinds, attributes or properties• measureable quantities like sampling error, …

Page 6: Assessing Quality for  Integration Based Data

Introduction – Reporting View

• Stronger matching of the processing and the reporting

view seems necessary

– Starting point can be attributes and properties of statistical

processes necessary for assessing quality• From basic quality concepts we build higher level

elements by aggregation• Prerequisite for definition of necessary basic quality

concepts:

– Empirical analysis of different production processes • Final result is a User Oriented Quality Certificate

Page 7: Assessing Quality for  Integration Based Data

Data Generating Processes

• We can distinguish two broad classes of data generating

processes

– The survey based data generating process

– The integration based data generating process

Page 8: Assessing Quality for  Integration Based Data

Data Generating Processes – Survey based

• Most considerations about reporting quality start from the

traditional survey process

– Characteristics of the traditional survey process• One well defined target population (e.g. persons)• A rather homogeneous method for data collection

(e.g. questionnaire)• A more or less linear sequence of processing steps

(e.g. data cleaning, data editing, data imputation, output)• Final Output is one Output File

Page 9: Assessing Quality for  Integration Based Data

Data Generating Processes – Integration based

• Many Statistics do not follow such a linear production

scheme

– Examples: Indices, numerous balance sheets,

National Accounts, …. • Common characteristic:

Data are produced from many different sources• Let us call such processes as integration based processes• Data produced in such way are called

integration based data

Page 10: Assessing Quality for  Integration Based Data

Data Generating Processes – Integration based

– Characteristics of integration based data processing• Population:

– The underlying population may be split into segments » Example: Expenditures for education: government,

private enterprises, households– Many times more than one population is involved,

possibly also one population at different times» Example: calculation of indices

Page 11: Assessing Quality for  Integration Based Data

Data Generating Processes – Integration based

– Characteristics of integration based data processing• Data collection:

– Data collection is different for different segments

and populations– Many times the collected data are the output of already

existing data products • Main processing activities are alignment procedures

making the different sources comparable• Output may be a set of organized Data Files

Page 12: Assessing Quality for  Integration Based Data

Data Generating Processes – Workflow View

• Workflow for Survey Process

SamplingRegister

SamplingCollect Survey Data

Editing, Imputation,

Transformation

Additional Data

Final Micro File

Final Tables

Page 13: Assessing Quality for  Integration Based Data

Data Generating Processes – Workflow View

• Workflow for Integration Based Process

Selection, Editing, Preparation

Data Source

1

Data Source

2

Selection,Editing,

Preparation

Selection, Editing,

Preparation

Integration by Matching

Data Source

3

Inte-gration

1

Imputation, Computation

Selection, Trans-

formation

Integrationby Merging

Eding, Imputation,

Transformation

Final Data Files

Output Table

Page 14: Assessing Quality for  Integration Based Data

Data Quality for Integration Based Production

• Two important aspects of data quality

– Content quality • Are the measured “concepts” really the target “concepts” ?

– Production quality • Are the used methods sound?

Page 15: Assessing Quality for  Integration Based Data

Data Quality for Integration Based Production – Content Quality• Main reasons for lack of content quality

– Slight difference in the measurements of the variables

(“concepts” ) in case of reuse of already existing data– Example:

» Transport of goods on Austrian rails» Transport of goods according to data from railway

authorities (taking not into account that transport may

use partly German rails)

– Slight differences in the definition of the segments in the

underlying population

Page 16: Assessing Quality for  Integration Based Data

Data Quality for Integration Based Production – Content Quality• Conclusion:

Using data already collected for other purposes gives

often only proxy variables for the intended variables • Question:

Is this in coincidence with your mental concept of the

term “Non-Sampling Error”?• Manuals of international organizations are many times

rather vague with respect to such problems

Page 17: Assessing Quality for  Integration Based Data

Data Quality for Integration Based Production – Content Quality• Possible Strategies for Solution

– Statistical Models for aligning the concepts

– More detailed description of the concepts by using additional

variables characterizing the differences as formal properties

of the data

– More detailed description of the underlying populations by

using additional variables characterizing the differences

Page 18: Assessing Quality for  Integration Based Data

Data Quality for Integration Based Production – Processing Quality• Elements of processing quality

– Quality of methods used for the different components

of the integration based statistic• This implies that we do not have one method of collection,

one editing, one imputation,…

but many activities of that kind

– Quality of methods used in the integration process• Alignment of variables in order to overcome differences

in concepts• Standard activities like plausibility, editing, imputation

necessary for the integration activities

Page 19: Assessing Quality for  Integration Based Data

Assessing Quality for Integration Based Data

• If we know the quality of all the components used in the

integration process we have to think about transmission

of quality in the integration steps• Starting point should be an “Authentic Data System”

– All data used in the integration process

– Quality information about the different data sets of the

system

Page 20: Assessing Quality for  Integration Based Data

Assessing Quality for Integration Based Data

• Distinguish two types of quality transmission

– Quality compilation• Methods for representing quality of the overall product

– Quality calculations• Algorithms for assessing quality

• In both cases we need

– Methods for assessing quality

– Models of best practice / standards

Page 21: Assessing Quality for  Integration Based Data

Assessing Quality for Integration Based Data – Quality Compilations• In some cases the best we can do is better representation

of the quality dimensions of the used components

– Distribution of quality indicators

– Concentration of quality indicators

Page 22: Assessing Quality for  Integration Based Data

Assessing Quality for Integration Based Data – Quality Compilations

– Example: Coverage for integration based data • Structure of integrated sources together with coverage

information

Source 2

Coverage: high

Source 6 Coverage: high

Source 3 coverage: medium

Source 4: Coverage

low

Source 5: Coverage: very low

Source 7 Coverage:

high

Source 1 Coverage: high

Page 23: Assessing Quality for  Integration Based Data

Assessing Quality for Integration Based Data – Quality Compilations

– Coverage distribution

Page 24: Assessing Quality for  Integration Based Data

Assessing Quality for Integration Based Data – Quality Compilations

– Coverage concentration with respect to target concept

Page 25: Assessing Quality for  Integration Based Data

Assessing Quality for Integration Based Data – Quality Calculations• Methods will be in most cases not formulas but advanced

statistical procedures for different quality dimensions

– Examples: • Measurement of accuracy using variances, standard errors

or coefficient of variation– Could be done by using bootstrap

(e.g. applied for indices by NSO-GB)• Simulation techniques• Sensitivity analysis (“robustness”)

Page 26: Assessing Quality for  Integration Based Data

Conclusions

• Assessing quality of integration based statistics needs

– Clear separation of content based quality and processing

based quality

– Better documentation / representation of complex

production processes, Usage of Workflow Models

– Documentation of the authentic data file

– Definition of best practice / standards for integration

processes

– Algorithms for calculation quality dimensions

– Methods for representation of quality indicators