27
All the answers? Statistics New Zealand’s Integrated Data Infrastructure Paper by Felibel Zabala, Rodney Jer, Jamas Enright and Allyson Seyb Presented by Felibel Zabala Sept 2012

All the answers? Statistics New Zealand’s Integrated Data Infrastructure Paper by Felibel Zabala, Rodney Jer, Jamas Enright and Allyson Seyb Presented

Embed Size (px)

Citation preview

All the answers? Statistics New Zealand’s Integrated Data

InfrastructurePaper by Felibel Zabala, Rodney Jer,

Jamas Enright and Allyson Seyb

Presented by Felibel Zabala

Sept 2012

Statistics New Zealand’s Integrated Data Infrastructure (IDI)

Merges data from different suppliers including Statistics NZ

Variable quality of the different datasets, both within and between

2

Statistics New Zealand’s Integrated Data Infrastructure (IDI)

Linking clean datasets is not easy, much more difficult for variable quality in datasets

Importance of an effective and efficient editing strategy

3

Main objective

Present some of the issues on and solutions to any linked administrative dataset with a focus on one of Statistics NZ‘s first integrated dataset, the Linked Employer-Employee Data (LEED)

4

LEED

Provides the backbone of the IDI prototype

Links longitudinal business data from Statistics NZ’s Business Frame to a longitudinal series of payroll tax data from Inland Revenue (IRD)

Used to produce quarterly statistics that measure labour market dynamics at various levels, eg filled jobs, worker flows, and total earnings

5

LEED Payroll data

Collected from employers for New Zealand’s taxation system through IRD’s Employer Monthly Schedule (EMS)

Information available from EMS Employer/employee name and IRD number taxable earnings for work performed taxed at source

of income tax deductions (pay-as-you-earn or PAYE,

withholding tax, child support payment, student loan indicator amount)

start and finish dates of employment6

LEED – additional details

Also includes payments made to beneficiaries by the government

Contains a subset of the self-employed

7

LEED – additional details (cont’d)

Collection unit - the legal entity that files the EMS return

Statistical unit – or the ‘employer’ in LEED is the geographical or physical location of the business

8

Methods of integration in LEED

Figure 1. Unit record links in LEED9

Figure 1. Unit record links in LEED10

Linking employer to enterprise

Figure 1. Unit record links in LEED11

Linking employer longitudinally

Figure 1. Unit record links in LEED12

Linking enterprise and geo longitudinally

Figure 1. Unit record links in LEED13

Linking employee longitudinally

Variables edited in LEED

IRD numbers

Gross earnings

Date of birth

Sex

Workplace of an employee

Start and end dates of employment

Editing strategy: Do not replace any IRD data unless there is strong evidence it is an error

14

Variables edited in LEED (cont’d)

IRD numbers

Imputation of sex

Imputation of start and end dates of employment

15

Variables edited in LEED (cont’d)

Gross earnings Presence of systematic errors Detection method – use of ratio edit: PAYE/gross

earnings Imputation method

Date of birth Presence of systematic errors Detection method – edit rules based on an

employee’s age against some events Imputation method

16

Variables edited in LEED (cont’d)

Imputation of workplace of an employee Uses transportation method, where the imputed workplace of an employee is the

geo that minimises the distance between an employee’s home address to the geo, subject to the constraints that

each employee is assigned to a geo and the total number of employees allocated to a

geo should equal the number of employees expected from the geo

17

The IDI prototype

Datasets linked to LEED

Benefit data

Tertiary education data

Administrative tertiary education data and student loans and allowances data

Statistics NZ’s Household Labour Force Survey (HLFS) and its supplementary surveys

18

The IDI prototype (cont’d)

Other linked dataset in IDI

The Longitudinal Business Database (LBD) prototype includes information on business

demographics, financial data, employment, goods exports, government assistance, and management practices

19

The IDI prototype (cont’d)

Figure 2. Linking in the IDI prototype20

Issues in linking in the IDI

Lack of a common identifier across datasets

Main variables in the Central Linking Concordance (CLC) IRD numbers, passport numbers, and student ID,

where available

Use of demographic variables as partial identifiers

21

Issues in linking in the IDI (cont’d)

Need for a standard software for automated data linkage robust to data changes

Timing of receipt of data

22

Editing strategy in the IDI

Focus on ensuring high-quality linking variables are used in linking. Examples: Validity rules were used to edit names across

data sources Sex and date of birth are reformatted to ensure

common coding is used across data sources

Where inconsistencies occur in records linked from two different data sources, it is important to know which of the two data sources is more reliable

23

Editing strategy in the IDI (cont’d)

Process to resolve inconsistencies in personal details Most common value present in the datasets

should be kept Prioritise the data sources to determine the order

of retaining their values

24

Editing strategy in the IDI (cont’d)

Editing strategy should be able to

Edit inconsistencies from the same unit from different sources

Treat erroneous and missing variables in a record

Ensure consistency in variables across a record for a time period and over time

25

Next steps

Build of the IDI with a focus on improving the linking methodology

Determine standard quality measures for outputs produced using administrative data

26

Next steps (cont’d)

Redevelopment of LEED and SLA systems Investigate the use of geospatial information

to improve the employee allocation method Review of the editing of gross earnings Investigate the use of Banff

27