18
www.stat.gov.lt 20-21 March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT) Nadežda Fursova Chief specialist

Www.stat.gov.lt 20-21 March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)

Embed Size (px)

Citation preview

Page 1: Www.stat.gov.lt 20-21 March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)

www.stat.gov.lt

20-21 March 2013

ESSnet DWH - Workshop IV

DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS

HIERARCHIES (S-DWH CONTEXT)

Nadežda FursovaChief specialist

Page 2: Www.stat.gov.lt 20-21 March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)

www.stat.gov.lt

20-21 March 2013

ESSnet DWH - Workshop IV

The main topics of the presentation

What is data linking?

Input data set

Data linkage methods

Problems we meet linking data

Page 3: Www.stat.gov.lt 20-21 March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)

www.stat.gov.lt

20-21 March 2013

ESSnet DWH - Workshop IV

Data linking and data integration

Linking different input sources (administrative data, surveys data, etc.) to one population.

(= data linking)

In a next step, these linked data will be processed to one consistent dataset that will

greatly increase the power of analysis then possible with the data.

(= data integration)

oppurtunity• reducing costs• improve quality

challenge• preporatory work to examine data• normally easy if unique ID, but unlinkable cases

Page 4: Www.stat.gov.lt 20-21 March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)

www.stat.gov.lt

20-21 March 2013

ESSnet DWH - Workshop IV

Type of data linking?

Record linkage for organizing ONE dataset

• data cleaning• removing duplicates

for merging TWO or MORE datasets• merging data to one consistent dataset

Page 5: Www.stat.gov.lt 20-21 March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)

www.stat.gov.lt

Input data set

The first step in data linkage is to determine needs and check data availability.

Proposed scope of input data:

20-21 March 2013

ESSnet DWH - Workshop IV

Page 6: Www.stat.gov.lt 20-21 March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)

www.stat.gov.lt

Statistical Business Register and Population frame to link several input data in a SDWH we need to agree about the default target

population and about the enterprise unit to which all input data are matched

The default target population is defined as statistical enterprise units which have been active during the reference year.

input source -‘backbone: population frame’; includes the following information:

Frame reference year Statistical enterprises unit, including its national ID and its EGR ID Name/address of enterprise of the enterprises National ID of the enterprises Date in population (mm/yr) Date out of population (mm/yr) NACE-code Institutional sector code Size class

the population frame is crucial information to determine the default active population

20-21 March 2013

ESSnet DWH - Workshop IV

Page 7: Www.stat.gov.lt 20-21 March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)

www.stat.gov.lt

Data sources

One aim of a SDWH is to create a set of fully integrated data about enterprises. And these data may come from different sources like surveys, administrative data, accounting data and census data. Different data sources cover different populations.

To link this input data sources and to ensure that these data are linked to the same enterprise unit and are compared with the same target population is the main issue.

Main data sources : Surveys (censuses, sample surveys) Combined data (survey and administrative data) Administrative data

20-21 March 2013

ESSnet DWH - Workshop IV

Page 8: Www.stat.gov.lt 20-21 March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)

www.stat.gov.lt

Defining metadata

The term metadata is very broad. A distinction is made between “structural” metadata that define the structure of statistical data sets and metadata sets, and “reference” metadata describing actual metadata contents.

NSIs need to define metadata before linking sources

What kind of reference metadata needs to be submitted?

ESMS Metadata files are used for describing the statistics released by Eurostat. It aims at documenting methodologies, quality and the statistical production processes in general.

20-21 March 2013

ESSnet DWH - Workshop IV

Page 9: Www.stat.gov.lt 20-21 March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)

www.stat.gov.lt

The statistical unit base The unit base is closely related to the SBR Their contents are also closely related to available input data and it was recommend to

consider it as a separate input source. This unit base describes the relationship between the different units and the statistical enterprise unit

20-21 March 2013

ESSnet DWH - Workshop IV

Example of Netherlands Unit BaseExample of Lithuania

Page 10: Www.stat.gov.lt 20-21 March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)

www.stat.gov.lt

Data linking methods

Data linkage methods usually fall across a spectrum between :

Deterministic linkage – methods involve exact one-to-one character matching of linkage variables.

Probabilistic linkage – methods involve the calculation of linkage weights estimated given all the observed agreements and disagreements of the data values of the matching variables.

A combination of linkage methods may be used, but the choice of method depends on the types and quality of linkage variables available on the data sets to be linked.

20-21 March 2013

ESSnet DWH - Workshop IV

Page 11: Www.stat.gov.lt 20-21 March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)

www.stat.gov.lt

Deterministic linkage simplest method of matching – sort/merge

deterministic linkage is based on exact matches. Variables used in deterministic linkage need to be accurate, robust, stable over time and complete.

works best if there are common unique identifier (company ID number, Social Security number, etc.)

When there is no unique identifier (= not ideal)

use statistical linkage key (SLK). Generally it is a combination of attribute, including last name, first name, sex and date of birth.

stepwise deterministic record linkage - more sophisticated form of deterministic linkage. It has been developed in response to variations that often exist in the attributes that are used in creating the SLKs for deterministic linkage.

“rules-based linkage” - a set of rules can be used to classify pairs of records as matches or non-matches.

20-21 March 2013

ESSnet DWH - Workshop IV

Page 12: Www.stat.gov.lt 20-21 March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)

www.stat.gov.lt

Deterministic linkage

Statistical linkage keys: Generally most SLK for personnel statistics are constructed from last name, first name,

sex and full date of birth. SLK protects privacy and data confidentiality because they serve as an alternative to a person’s name and dates of birth being on the data sets to be linked.

There are two kinds of errors associated with SLKs. there may be incomplete or missing data items on an individual’s record, which

means that SLK will be incomplete. errors in the source data may lead to generation of multiple SLKs for the same

individual or multiple individuals will share the same SKL.

Problems: often no unique, known and accurate ID poor quality data (errors, variations and missing data, etc.)

20-21 March 2013

ESSnet DWH - Workshop IV

Page 13: Www.stat.gov.lt 20-21 March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)

www.stat.gov.lt

Probabilistic linkage

may be undertaken where there are no unique identifiers or SLKs

or where the linkage variables and/or entity identifiers are not as accurate, stable or complete as are required for deterministic method

can lead to much better linkage than simple deterministic linkage methods

has a greater capacity to link records with errors in their linking variables

20-21 March 2013

ESSnet DWH - Workshop IV

Page 14: Www.stat.gov.lt 20-21 March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)

www.stat.gov.lt

Probabilistic linkage

M-probability (match probability):

probability that a field agrees given that the pair of records is a true match

for any given field, the same M-probability applies for all records

U-probability (unmatch probability):

probability that a field agrees given that the pair of records is not a true match.

often it is simplified as the chance that two records will randomly match

20-21 March 2013

ESSnet DWH - Workshop IV

Page 15: Www.stat.gov.lt 20-21 March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)

www.stat.gov.lt

Summary deterministic and probabilistic linking

Ideal situation – availability of unique ID

Simplest and fastest method

Best quality

( = deterministic linking)

For a SDWH a unique ID is desired for most important datasources (if not, the work will be too elaborative)

If no unique ID for some – less important - datasources several deterministic and probabilistic linking techniques (as presented before) can

be used

20-21 March 2013

ESSnet DWH - Workshop IV

Page 16: Www.stat.gov.lt 20-21 March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)

www.stat.gov.lt

The data linkage processThe data linkage process may vary, depending on linkage model and the linkage method.

But there are however four steps that are common to both data linkage models:

Data cleaning and standardization

Blocking (in case of large datasets)

Record pair comparison

Decision model

Determinants of linkage quality:

the quality of SLKs (in case of deterministic linkage) the quality of blocking and linkage variables (in the case of probabilistic linkage).

Poor quality of variables can lead to some records not being linked or being linked to wrong records.

20-21 March 2013

ESSnet DWH - Workshop IV

Page 17: Www.stat.gov.lt 20-21 March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)

www.stat.gov.lt

Measures of quality of data linkage

Measures that may be used to asses data linkage quality include:

accuracy

sensitivity

specificity

precision

false-positive

20-21 March 2013

ESSnet DWH - Workshop IV

Page 18: Www.stat.gov.lt 20-21 March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)

www.stat.gov.lt

ConclusionsFor a successful data linking the population of the different datasources should be known

the input sources should be of a high quality

an unique identifier is desired

If no unique identifier

different methods to apply (deterministic and probabilistic)

Quality of data linkage

depends on presence of unique ID

AND accuracy + precision of data and false-positive ratios when linking

When data linking and data integration in next steps

challenge to deal with errors, missing data, conflicting data (presentations tomorrow)

20-21 March 2013

ESSnet DWH - Workshop IV