ESSnet on Datawarehousing - the business register Pieter Vlag – Statistics Netherlands

Preview:

Citation preview

2

Outline of the presentation

• DataWareHouse and importance population frame

• relationship population frame - business register

- (default) target population, statistical units

• other crucial datasources: “backbones”

- turnover + employment

• datalinking : the statistical unit base

• conflicting information between datasources

- when correcting in statistical DWH

- when correcting in backbones

- when feedback to business register

ESSnet DWH – business register

3

Definition of a statistical Datawarehouse (according to the FPA)

ESSnet DWH – business register

The broad definition of a data warehouse to be used in this ESSnet is therefore:

‘A common conceptual model for managing all available data of interest, enabling the NSI to (re)use this data to create new data/new outputs, to produce the necessary information and perform reporting and analysis, regardless of the data’s source.’

4

A DataWarehouse: the general idea

ESSnet DWH – business registerAs staging area is “core business” for NSIs, term statistical DWH is used for staging area + WareHouse

5

The statistical DataWarehouse: architecture and layers

ESSnet DWH – business register

6

The statistical DataWareHouse:processing steps the GSBPM model

ESSnet DWH – business register

process

input DWH / int. data

5.1a: link data5.1b: integrate datasee presentation Fursova

Calculate aggregates

Titel van de presentatie 7datasource 1 datasource 2 datasource 3

Output 1 Output 2 Output 3

Linking

Processing (integration layer)

Integrated data

p.analysep.analyse

4

GSBPM-step

5.1

5.2-

5.6

5.7

6

7

8

A datawarehouse without population frame

ESSnet DWH – business register

Datasource I:Admin data

Datasource I:Survey 1

Datasource I:Survey 2

Datasource I:BIG DATA

• different sources cover different enterprises -> information about ?• timing of availability sources differs -> when complete desc. available ?

9

A Datawarehouse with a population frame

ESSnet DWH – business register

Pop

ulat

ion.

Dat

asou

rce

1: a

dmin

dat

a 1

Dat

asou

rce

2: B

IG D

ATA

Dat

asou

rce

3: s

urve

y 1

Dat

asou

rce

4:

surv

ey 2

ADVANTAGE:the coverage of DWH is known (e.g. which enterprises are included in a DWH)

10

Units and target population

The population should be known for the

• datawarehouse; e.g. “about which enterprises info”

• its preparation phase ; e.g. when linking data sources

Challenges are:

• units may differ between the data sources

- decision: which unit used for linking

• what is the reference population- decision: how is the default target population defined

ESSnet DWH – business register

11

Proposals

• Only statistical unit (=enterprise) is used

- for data-linking

- in processing phase of the statistical - DWH

- justification: most obvious, ESSnet on Consistency, maintenance

• Default target population :

all enterprises with economic activity in reference period (e.g. year)

- justification: SBS-regulation

- widest definition of enterprises from which flexible outputs for subpopulations can be derived

- term default is used: as subpopulations do have a target population, tooESSnet DWH – business register

Titel van de presentatie 124

GSBPM-step

5.1

5.2-

5.6

5.7

6

7

Linked data

Integrated data

Processing on stat. unit + default target population only

flexible datasources with different populations and units

Weighting to flexible pop.

flexible output for different populations, and units

13

Population frame and the Business Register

Determination of the default target population in SDWH in 2 steps:

• the population frame, i.e. a list of enterprises with a certain kind of activity during a period.

• confirmation which enterprises of the list really performed economic activities during a period

The business register provides information for the population frame.

Therefore, the statistical Business Register is an indirect datasource for the statistical-DWH

ESSnet DWH – business register

14

Information needed from stat.business register

Recommended information for the population frame :• the frame reference year

• the statistical enterprises unit, including national ID and EGR ID

• the name and address of the enterprise

• the national identification number (ID) of the enterprise

• the date in population (mm/yr)

• the date out of population (mm/yr)

• the NACE-code

• the institutional sector code

• a size class

ESSnet DWH – business register

15

Other backbones

ESSnet AdminData: VAT and social security admin • almost complete for quarter and annual

• can be used for high-quality estimates for turnover + employment respectively.

ESSnet DWH: VAT and social security data are crucial• to confirm the activity status of enterprises

implictly to determine the default target population

• to integrate data suitable for flexible outputs

measurement errors are reduced of sample survey (or data about subpopulation) if weighting to pop.numbers + VAT-turnover + employment

Proposal: to include these admin data as backbones in a stat-DWH

ESSnet DWH – business register

16

Source layer

Int. + Analyses layer

Access layer

Integration layer

SBR

Pop-framedata 1 data 2

VAT empl.

GSBPM 5.1: link & integrate

GSBPM 5.2-5.6: “process”

GSBPM 5.7-5.8: calculate aggregates

Check processing

GSBPM 6: analyse / “DATAWAREHOUSE”

GSBPM 7-9: disseminate

Backbones in a statistical-DWH

Backbones are crucial for data-linking and data-integration;-> need to be checked/cleaned by source in the source layer

17

Source layer

Int. + Analyses layerIntegration layer

SBR

Pop-framedata 1 data 2

VAT

empl.

GSBPM 5.1: link & integrate

GSBPM 5.2-5.6: “process”

GSBPM 5.7-5.8: calculate aggregates

Check processing

GSBPM 6: analyse

GSBPM 7-9: disseminate

Observed: admin data incorporated in BR

When choosing this option, - important part of linking process outside the S-DWH- unless S-DWH integral part of S-DWH (maintenance ?)

18

Determining default target population

ESSnet DWH – business register

If statistical-DWH covers annual statistics only

• relatively straightforward

- derive population frame from business register at the end of reference year t

- determine active or non-active as soon as VAT and/or employment data become available

If STS included in statistical-DWH more complicated:

- updating necessary !

19

Updating population

ESSnet DWH – business registernov aug

jan apr jul okt jan apr jul

FATS survey population year T

FATS frame population of active

units year T

Undercoverage

Overcoverage

In both frame populations

FATS frame population of active

units year T (revised)

Frame error procedure

novRelease 3

julRelease 4

decRelease 1

julRelease 2

20SBR

Pop-frame

data 1 data 2

VAT empl.

GSBPM 5.1: link & integrate

GSBPM 5.2-5.6: “process”

GSBPM 5.7-5.8: calculate aggregates

Check processing

“DATAWAREHOUSE”

The largest enterprises

L.E.

output 1output 2

output 3

If a team within a NSI produces consistent microdata for largest enterprises-> consider this source as backbone

21

Units: ideal situation

ESSnet DWH – business register

• enterprise has a unique ID

• enterprise group has a unique ID

enterprise and enterprise group

• correspond with statistical definitions

• are used in all data sources

In practice more complex situations do exist (especially when using more admin data)

Titel van de presentatie 224

GSBPM-step

5.1

5.2-

5.6

5.7

6

7

Linked data

Integrated data

processing on one unit + one population only

flexible datasources with different population and units

Flexible output for different populations, and units

Key question: how to manage these different in- and output units and their relationships to the statistical unit

ESSnet DWH – business register 23

ENTERPRISE(=statistical unit)

ENTERPRISE GROUP

Legal unit

Legal unit

“Accountìng” unit

“Accountìng” unit

“VAT-unit”

other units

“other tax”units

enterprise

Enterprise

Local unit

LKAU

KAU

Enterprise group

INPUT IN S-DWHprocessing

OUTPUT

24

The unit base

ESSnet DWH – business register

Some remarks:

• Complexity of unit base depends on

- scope of statistical-DWH

- national legislation (practices) with respect to enterprise units

• Unit base closely related to Business Register. Main motivation to place this base outside the Business registers

- more flexible in case of new in- and outputs

- more transparent in case of linking errors

25SBR

Pop-frame

VAT empl.

GSBPM 5.1: link & integrate

GSBPM 5.2-5.6: “process”

GSBPM 5.7-5.8: calculate aggregates

Check processing

“DATAWAREHOUSE”

Position of Business Register in stat -DWH

L.E.

output 1output 2

output 3

survey

units

tax BIG DATA

other

26

Feedback to Business Register

ESSnet DWH – business register

In case of conflicting information between datasources and conclusion is influential error in backbones (and indirectly SBR)

• When incorporating corrections in statistical DWH ?

• When incorporating corrections in backbones ?

• When incorporating corrections in SBR?

27SBR

Pop-frame

survey otherunits

VAT empl.

GSBPM 5.1: link & integrate

GSBPM 5.1-5.6: “process”

GSBPM 5.7-5.8: calculate aggregates

Check processing

L.E.

Correction of information

In SDWH: corrections at 5.6

In backbones themselves: timing most important revisions

In SBR: after end of year (for consistency) – exception major impact

“DATAWAREHOUSE”

output 1output 2

output 3

28

Conclusions

ESSnet DWH – business register

Requirements for statistical-DWH

• Population well defined

• Use of one unit in processing

Backbones desired for

• populations, VAT-turnover, admin data employment, large enterprises

Business Register is indirect input for statistical DWH

• population frame, unit base, survey

Timing of corrections errors (backbone information)

• in DWH: before weighting

• in backbone: when revising

• in Business Register: end of year