52
Micro integration Combining data from different modes and sources ENP course [email protected]

Micro integration

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Micro integration

Combining data from different modes and sources

ENP course

[email protected]

Learning outcomes

• Greater knowledge of micro integration

• Identify the key steps in the micro integration process

• Appreciate some specific techniques which may be applied

in micro integration

What is micro-integration?

“Micro integration involves matching data from statistical units

at an individual level, with the goal of compiling better

information than is possible when using the separate

sources.” – Bart Bakker

Why?

Survey sampling paradigm:

• Use as sampling frame

• Reduce sampling error

– Post-stratification

– Ratio/regression estimation

• Reduce non-response error

– Response enhancement during data collection

– Statistical adjustment after data collection

Why?

Beyond…

• Use as target data

– New statistics

– Reduce costs

• Harmonisation of statistics

• Small population groups

• Small area estimation

• Quality measures

Italian Integrated System of Statistical

Registers (ISSR)

6

• Integrated System of Statistical Registers• single logical environment to

support the consistency of statistical production processes and improve outputs for users

• in particular consistency in “identification” and “estimation” of units and variables for the system as a whole

• New analyses starting from populations in registers

PopulationRegister

TerritoryRegister

Business Register

Activity Register

Statistics New Zealand - IDI

Micro integration process

• Harmonisation of units

• Harmonisation of reference periods

• Completion of populations

• Harmonisation of variables

• Harmonisation of classifications

• Adjusting for measurement errors

• Adjusting for missing data

• Derivation of variables

• Overall consistencyPaul van der Laan (2000)

Harmonisation of units

Example of wages in LFS

• Labour Force Survey (LFS) – large panel survey of people

measuring the labour force.

• Statistical unit: person 15-74 year old

• Variable: Gross monthly earnings for employees main job

• 6 countries integrate from register (DK, ES, NL, AT, SI, SE)

Norwegian wages integration into LFS

• Employers must report information on wages for all

workers to a central system every month

• Personal Identity number is associated with employees

• Statistical unit: work relation/job

Harmonisation of units

Labour Force Survey

Jobs register

Multiple job holders decide for

themselves which job is to be

considered as the main job. In doubtful

cases the main job should be the one

with the greatest number of hours

usually worked.

Harmonisation of units – main job

1. Compulsory service

2. Ordinary job

• Longest hours

• Held longest

• Greatest income

3. Central Tax Office for foreign affairs

• Greatest income

• Held longest

4. Freelancer

• Longest hours

• Held longest

• Greatest income

5. Other

Labour Force Survey Norwegian Jobs register

Micro integration process

• Harmonisation of units

• Harmonisation of reference periods

• Completion of populations

• Harmonisation of variables

• Harmonisation of classifications

• Adjusting for measurement errors

• Adjusting for missing data

• Derivation of variables

• Overall consistencyPaul van der Laan (2000)

Harmonisation of reference periods

• Fortnightly payroll, monthly statistics

• Environmental monitoring and health outcomes

Smoothing

splines

Micro integration process

• Harmonisation of units

• Harmonisation of reference periods

• Completion of populations

• Harmonisation of variables

• Harmonisation of classifications

• Adjusting for measurement errors

• Adjusting for missing data

• Derivation of variables

• Overall consistencyPaul van der Laan (2000)

Central population register Address and building register

Person ID Household

ID

Dwelling ID

001 1 H010557

002 1 H010557

003 2 H022588

Dwelling ID Owner ID Address

H010557 001 Akersveien 26

H022588 003 Akersveien 26

Central population register Address and building register

Person ID Household

ID

Dwelling ID

001 1 H010557

002 1 H010755

003 3 H022589

Dwelling ID Owner ID Address

H010557 001 Akersveien 26

H010758 002 Akersveien 26

H022588 005 Akersveien 26

?

?

?

Central population

register

Address and

building register

85%

(Norway 2011)

Household type

Household size

Average age – adults

Average age – children

Building type

Address

Building type

Living areal

(Single) Nearest Neighbour Imputation

• Select a donor unit which is considered to be similar to the

unit requiring imputation, and donate the response from

that unit to the unit requiring a response.

• Requires additional variables to calculate a distance

measure matrix between the donors and recipients

Household

ID

Average age

of adults

Household

size

Dwelling

number

001 30 2 ?

002 25 2 H02033

003 45 1 H01038

004 40 3 H02078

005 55 4 B01093

001 002 003 004 005

001 0

002 5 0

003 15.03 20.02 0

004 10.05 15.03 5.39 0

005 25.08 30.07 10.44 15.03 0

(Single) Nearest Neighbour Linkage

Euclidean distance:

H02033

HouseholdCentral population register

DwellingAddress and building register

Household type

Household size

Average age – adults

Average age – children

Building type

Address

Building type

Living areal

Double Nearest Neighbour Linkage

Implementation – Norwegian census 2011

Micro integration process

• Harmonisation of units

• Harmonisation of reference periods

• Completion of populations

• Harmonisation of variables

• Harmonisation of classifications

• Adjusting for measurement errors

• Adjusting for missing data

• Derivation of variables

• Overall consistencyPaul van der Laan (2000)

Norwegian employment in National accounts (KNR), Labour Force

Survey (AKU) and register based

Enterprise surveys example

• The Labour Cost Survey (LCS) concerning structural

statistics on earnings and on labour costs.

• The Continuing Vocational Training Survey (CVTS),

concerning statistics relating to vocational training in

enterprises.

• Register of enterprises

Harmonisation of variables

All enterprises

Number of employees

Survey of enterprises with

>= 10 employees

Number of employees

includes:

• homeworkers if there is

an explicit agreement

that the homeworker is

remunerated on the

basis of the work done

and they are included

on the pay-roll.

Survey of enterprises

with >=10 employees

Number of employees

includes:

• unpaid family workers

Labour Cost Survey (LCS) Continuing vocational

training survey (CVTS)Register of enterprises

Harmonisation of variables

• Coverage of costs of CVT courses

and other forms of training

• Includes costs associated with

apprentices

• Coverage of costs of CVT courses

only

• Exclusion of costs for apprentices.

Labour Cost Survey (LCS) Continuing vocational training survey (CVTS)

Variable names

Same name/label Different name/label

Same definition Same value Same value

Different definition Different value Different value

Micro integration process

• Harmonisation of units

• Harmonisation of reference periods

• Completion of populations

• Harmonisation of variables

• Harmonisation of classifications

• Adjusting for measurement errors

• Adjusting for missing data

• Derivation of variables

• Overall consistencyPaul van der Laan (2000)

Harmonisation of classifications

All enterprises

Number of employees

• Exact number

Survey of enterprises

with >= 10 employees

Number of employees:

• 10-49 employees

• 50-249 employees

• 250-499 employees

• 500-999 employees

• 1000+ employees

Survey of enterprises

with >=10 employees

Number of employees

• 10-49 employees

• 50-249 employees

• 250+ employees

Labour Cost Survey (LCS) Continuing vocational

training survey (CVTS)Register of enterprises

Exercises – micro integration

• The harmonisation of variables and classifications is an

important part of micro integration.

1. Describe an example of when the harmonisation of variables or

classifications was done well in your organisation. What was the

outcome?

2. Do you have any examples of when this wasn’t done optimally.

What was the outcome?

Micro integration process

• Harmonisation of units

• Harmonisation of reference periods

• Completion of populations

• Harmonisation of variables

• Harmonisation of classifications

• Adjusting for measurement errors

• Adjusting for missing data

• Derivation of variables

• Overall consistencyPaul van der Laan (2000)

Measurement error

• Difference between the actual value of a quantity and the

value obtained by a measurement. Repeating the

measurement will improve (reduce) the random error

(caused by the accuracy limit of the measuring instrument)

but not the systemic error (caused by incorrect calibration

of the measuring instrument).

34

World estimates of maternal mortality ratio

Number of maternal deaths per 100 000 live births

Proportion of deaths among women of reproductive age due to maternal causes

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑙𝑖𝑣𝑒 𝑏𝑖𝑟𝑡ℎ𝑠∙ 100000

• Country and world region estimates

• Denominator: UNPD estimates

Maternal mortality data sources

• Maternal mortality data can come from a variety of sources:

– Vital registration

– Household surveys (sisterhood method)

– Censuses

– Reproductive-age mortality studies (RAMOS)

– Verbal autopsy

Measurement errors in vital registration

Vital registration:

• Country/year specific adjustment factors used if available

• 1.5 adjustment used if not

Survey/census data

• Underreporting of maternal deaths (10% adjustment)

• Over-reporting when all pregnancy deaths reported (10%-

15% adjustment)

Micro integration process

• Harmonisation of units

• Harmonisation of reference periods

• Completion of populations

• Harmonisation of variables

• Harmonisation of classifications

• Adjusting for measurement errors

• Adjusting for missing data

• Derivation of variables

• Overall consistencyPaul van der Laan (2000)

Imputation

Imputation

• Many techniques

– Donor imputation: hot deck, cold deck, nearest neighbour, historic

– Explicit model: average, regression

• Single and multiple imputation

• Stochastic

• Restrictions

• Multivariate and univariate

Micro integration process

• Harmonisation of units

• Harmonisation of reference periods

• Completion of populations

• Harmonisation of variables

• Harmonisation of classifications

• Adjusting for measurement errors

• Adjusting for missing data

• Derivation of variables

• Overall consistencyPaul van der Laan (2000)

Derivation of variables - employment

• Integration of:

– Micro data fra a-ordningen

– Business register

– Central pop register

– Unemployment register

– Compulsory military service register

– SFU – overseas tax office

– Illness - doctor register

– Education database

– Immigration data

Derivation of variables - employment

• Start and stop dates indicate person is in work and:

– Received income in the current period or

– Completing compulsory service or

– Received governmental welfare indicating temporary absence (sick pay,

maternity leave etc) or

– Registered as temporarily laid off or permitted leave (<90 days) or

– The start date for work is very recent

– Received income in the previous month and next month

Micro integration process

• Harmonisation of units

• Harmonisation of reference periods

• Completion of populations

• Harmonisation of variables

• Harmonisation of classifications

• Adjusting for measurement errors

• Adjusting for missing data

• Derivation of variables

• Overall consistencyPaul van der Laan (2000)

Overall consistency

• Obvious errors:– Logical constraints

• Pregnant males

• Parts don’t add to the sum

– Unreasonable

• Negative wages or number of employees

– Extreme values

• Probable errors– All jobs equate to < 160% work load

– Statistical distributions: regression controls, quartile methods

Exercises – Measurement error

Scenario:

The Rental Market Survey provides estimates for the average monthly rents of rental housing by district. The statistical unit is dwelling, while individual people living in those dwellings are asked to provide monthly rental price details. It is a panel survey with sampled units are asked every month for 13 months.

Your organisation has recently swapped the Rental Market Survey from a CATI data collection to a CAWI collection mode. The new estimates show rental property prices have gone up significantly in the most recent period. You suspect it could be due a measurement error due to mode effects.

1. What technique(s) could you use to investigate and measure a measurement error? (Given unlimited resources)

2. How could the micro integration of survey data with other data sources be used to investigate measurement error and improve estimation.

References

Alkema L, Zhang S, Chou D, et al. (2015) A Bayesian approach to the global estimation of maternal

mortality.

Bakker, Bart (2011) Micro Integration. Statistics Netherlands, The Hague

HLG-MOS (2017) DRAFT A guide to data integration for Official Statistics

Van der Laan, Paul (2000) Integrating administrative registers and household surveys. Netherlands Official

Statistics, 2

WHO (2015) Trends in maternal mortality: 1990 to 2015: estimates by WHO, UNICEF, UNFPA, World Bank

Group and the United Nations Population Division. Geneva, Switzerland

Working group Labour Market Statistics (LAMAS), Document for item 3.4 of the agenda. (2014) Future of

the CVTS data collection. Eurostat/F3/LAMAS/31/14.

Zhang, L.-C. (2012). Topics of statistical theory for register-based statistics and data integration. Statistica

Neerlandica, vol. 66, pp. 41-63.

Zhang, L.C. & Hendriks, C. (2012) Micro integration of register-based census data for dwelling and

household. Work Session on Statistical Data Editing