UNECE Workshop on Confidentiality Manchester, 17.-19. December 2007 Comparing Fully and Partially Synthetic Data Sets for Statistical Disclosure Control

UNECE Workshop on Confidentiality

Manchester, 17.-19. December 2007

Comparing Fully and Partially Synthetic Data Sets for Statistical Disclosure Control

in the German IAB Establishment Panel

Jörg Drechsler, Stefan Bender (Institute for Employment Research, Germany)

& Susanne Rässler

(University of Bamberg)

2

Overview

Multiple Imputation for Statistical Disclosure Control

The IAB Establishment Panel

Application of The Two Approaches

Comparison of The Results

Conclusion

3

YsynthetischYsynthetischYsynthetischYsynthetisch

Fully synthetic data sets (Rubin 1993)

advantages: - data are fully synthetic

- re-identification of single units almost impossible

- all variables are still fully available

disadvantages: - strong dependence on the imputation model

- setting up a model might be difficult/impossible

Yobserved

X Ynot observed

Ysynthetic

4

Partially synthetic data sets (Little 1993)

only potentially identifying or sensitive variables are replaced

5



6



advantages: - model dependence decreases

- models are easier to set up

disadvantages: - true values remain in the data set

- disclosure might still be possible

7

Overview




Comparison of the Results

Conclusions

8


Annually conducted Establishment Survey

Since 1993 in Western Germany, since 1996 in Eastern Germany

Population: All establishments with at least one employee covered by social security

Source: Official Employment Statistics

Response rate of repeatedly interviewed establishments more than 80%

Sample of more than 16.000 establishments in the last wave

Contents: employment structure, changes in employment, business policies, investment, training,

remuneration, working hours, collective wage agreements, works councils

9

Overview



Application of the Two Approaches


Conclusions

10

Generating fully synthetic data sets for the IAB Establishment Panel

Create a synthetic data set for selected variables from the wave 1997 from the Establishment Panel

Draw 10 new sample from the Official Employment Statistics using the same sampling design as for the Establishment Panel (Stratification by industry, size, and region)

The number of observations in each sample equals the number of observations in the panel ns=np=7332

Every sample is imputed ten times using sequential regression

Number of variables from the establishment panel: 48

Imputations are generated using IVEware by Raghunathan, Solenberger and Hoewyk (2001)

11

Imputation procedure for partially synthetic data

Only two variables are synthesized: - number of employees

- industry (16 categories)

Same variables for the imputation models

Imputation by sequential regression

Imputation model: - multinomial logit for the industry

- linear model for the cubic root of the nb of employees- 4 independent linear models defined by quartiles for the establishment size

Imputations based on own coding in R.

12

Overview





Conclusion

13

Analytical validity

Compare regression results from the original data with results from the synthetic data

First regression: Zwick (2005) analyses the productivity effects of different continuing

vocational training forms in Germany Probit regression to explain, why firms offer vocational training 13 Explanatory variables including: Share of qualified employees,

establishment size, industry, collective wage agreement, high qualification needs expected…

Second regression: Log(number of employees) on 15 industry dummies

Two data utility measures:- Comparison of the beta coefficients from the original data set and the synthetic data sets

- confidence interval overlap

14

Confidence interval overlap

Suggested by Karr et al. (2006)

Measure the overlap of CIs from the original data and CIs from the synthetic data

The higher the overlap, the higher the data utility

Compute the average relative CI overlap for any k

ksynksyn

koverkover

korigkorig

koverkoverk LU

LU

LU

LUJ

,,

,,

,,

,,

2

1

overUoverL

origL synL origUsynU

CI for the synthetic data

CI for the original data

15

Significant at the 0,1 % level Significant at the 1 % level Significant at the 5 % level

Results from the first regression (Zwick 2005)

16

Average overlap0,808 0,926

Average confidence interval (CI) overlap for the estimates from the first regression

17

= Significant at the 0,1 % level = Significant at the 1 % level = Significant at the 5 % level

Results from the second regression (log(nb. of employees) on industry)

= insignificant

180,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0

Industry dummy 1

Industry dummy 2

Industry dummy 3

Industry dummy 4

Industry dummy 5

Industry dummy 6

Industry dummy 7

Industry dummy 8

Industry dummy 9

Industry dummy 10

Industry dummy 11

Industry dummy 12

Industry dummy 13

Industry dummy 14

Industry dummy 15

Intercept

CI overlap fully synthetic data CI overlap part. synthetic data

Average overlap0,699 0,839

Average confidence interval (CI) overlap for the estimates from the second regression

19

Disclosure risk

Difficult to compare between partially and fully synthetic data sets

Disclosure risk is low for fully synthetic data sets, although not zero

DR is higher for partially synthetic data sets, because:

• True values remain in the data set

• Only survey respondents are included

For partially synthetic data sets a careful disclosure risk evaluation is necessary

20

Overview





Conclusions

21

Conclusions

Generating synthetic data sets can be a useful method for SDC

Advantages for partially synthetic data sets:

• Higher data validity• Imputation models easier to set up • Lower risk of biased imputations

Disadvantages for partially synthetic data sets:

• Higher risk of disclosure• Careful disclosure risk evaluation necessary

Agencies will have to decide depending on the complexity of the survey and the expected risk of disclosure

22

Thank you for your attention

Documents

UNECE Workshop on Confidentiality Manchester, 17.-19. December 2007 Comparing Fully and Partially Synthetic Data Sets for Statistical Disclosure Control