26
Multiple Imputation Approaches for Right- Censored Wages in the German IAB Employment Register European Conference on Quality in Official Statistics 2008, 10 July 2008 Thomas Büttner Institute for Employment Research (IAB) Susanne Rässler University of Bamberg

Thomas Büttner Institute for Employment Research (IAB) Susanne Rässler University of Bamberg

  • Upload
    ganit

  • View
    40

  • Download
    0

Embed Size (px)

DESCRIPTION

Multiple Imputation Approaches for Right-Censored Wages in the German IAB Employment Register. Thomas Büttner Institute for Employment Research (IAB) Susanne Rässler University of Bamberg. European Conference on Quality in Official Statistics 2008, 10 July 2008. Motivation. - PowerPoint PPT Presentation

Citation preview

Page 1: Thomas Büttner Institute for Employment Research (IAB) Susanne Rässler University of Bamberg

Multiple Imputation Approaches for Right-Censored Wages in the German IAB Employment Register

European Conference on Quality in Official Statistics 2008, 10 July 2008

Thomas BüttnerInstitute for Employment Research (IAB)

Susanne RässlerUniversity of Bamberg

Page 2: Thomas Büttner Institute for Employment Research (IAB) Susanne Rässler University of Bamberg

3

For a large number of research questions it is interesting to use wage data- Analyzing the gender wage gap- Measuring overeducation- …

To address this kind of questions two types of data often are used:- Survey data- Administrative data from the social security

Advantages of administrative data- Large number of observations- No response burden - No interviewer bias

Motivation

Page 3: Thomas Büttner Institute for Employment Research (IAB) Susanne Rässler University of Bamberg

4

Administrative data Represents 80 percent of the employees in Germany 2 percent random sample of all employees covered by social security 1.3 million persons

Problem:Wages can only be recorded up to the contribution limit of the social security system

The wage information is censored at this limit

The German IAB Employment Sample (1) Sample drawn from the IAB register data (employment history) supplemented by information on benefit recipients

Page 4: Thomas Büttner Institute for Employment Research (IAB) Susanne Rässler University of Bamberg

5

The German IAB Employment Sample (2)0

1.0e

+04

2.0e

+04

3.0e

+04

Freq

uenc

y

3 3.5 4 4.5 5lntentgelt

Daily wages in logs in Western Germany (2000)

Source: IAB Employment Sample

Page 5: Thomas Büttner Institute for Employment Research (IAB) Susanne Rässler University of Bamberg

6

Several possibilities to deal with censored wages

Advantage of multiple imputation:

The imputed data set can be used for a multiplicity of questions and analyses

e.g. average wages of certain groups, Analyzing regional wage dispersions, effects of a modification of the contribution limit…

The conventional approaches assume homoscedasticity of the residuals

Censored Wages

Since in general the dispersion of income is smaller in lower wage categories than in higher categories, the assumption of homoscedasticity is highly questionable with wage data

Page 6: Thomas Büttner Institute for Employment Research (IAB) Susanne Rässler University of Bamberg

7

Our Project Step 1: Developing approaches considering heteroscedasticity

Step 2: Simulation study to confirm the necessity and validity of the new approaches

Step 3: Using uncensored wage information from an income survey (German Structure of Earnings Survey, GSES) to validate the approaches

Step 4: Using external wage information for the imputation model

Page 7: Thomas Büttner Institute for Employment Research (IAB) Susanne Rässler University of Bamberg

8

Imputation Models

Single Imputation based on a homoscedastic tobit model

Single Imputation using a heteroscedastic model

Multiple Imputation based on a homoscedastic tobit model

Multiple Imputation considering heteroscedasticity

Page 8: Thomas Büttner Institute for Employment Research (IAB) Susanne Rässler University of Bamberg

9

Single imputation based on a homoscedastic tobit model

if if

where a is the contribution limit

Imputation by draws of random values according to the parameters estimated using a tobit model

As the true values are above the contribution limit, draws from a truncated normal distribution

Single Imputation

),0(~, 2* Nxyiid

iiíi ayi

*ii yy

ayi *

ayi *

)ˆ,ˆ(~ 2* itrunci xNya

Page 9: Thomas Büttner Institute for Employment Research (IAB) Susanne Rässler University of Bamberg

10

Development of an imputation approach considering heteroscedasticity (single

imputation) based on a GLS model for truncated variables

Imputation by draws from a truncated normal distribution

using individual variances

• Single imputation may lead to biased variance estimations (Little/Rubin 1987)

Single Imputation Considering Heteroscedasticity

)ˆ,ˆ(~ 2*iítrunci xNy

a

2ˆ i

Page 10: Thomas Büttner Institute for Employment Research (IAB) Susanne Rässler University of Bamberg

11

Multiple Imputation (1)

1 Impute the data set m times2 Analyze each data set3 Combine the results

Page 11: Thomas Büttner Institute for Employment Research (IAB) Susanne Rässler University of Bamberg

12

Multiple Imputation (2) 1. To be able to start the imputation based on MCMC, we first need to adapt starting values for

the parameters from a ML tobit estimation

2. In the imputation step, we randomly draw values for the missing wages from a truncated distribution

3. Based on the imputed data set, we compute an OLS regression

4. After this, we produce random draws for the parameters according to their complete data posterior distribution

5. We repeat the imputation and the posterior-step 5,000 times and use to obtain 5 complete data sets

),...,,()5000(*)2000(*)1000(*

iii yyy

),(~ )(2)()*( ttitrunc

ti xNy

a

RSSknXt )(~ 2)1(2

))(,ˆ(~ 1)1(2)()1( XXN ttt

Page 12: Thomas Büttner Institute for Employment Research (IAB) Susanne Rässler University of Bamberg

13

Imputation Model Considering Heteroscedasticity (1)

Based on the multiple imputation approach with additional draws fordescribing the functional form of the heteroscedasticity

1. We now start the imputation by adapting starting values from a GLS estimation

2. Then we are able to draw values for the missing wages from a truncated distribution using individual variances

3. Then a GLS regression is computed based on the imputed data set

),(~ )(2)(i

)*( ti

ttrunc

ti xNy

a

Page 13: Thomas Büttner Institute for Employment Research (IAB) Susanne Rässler University of Bamberg

14

Imputation Model Considering Heteroscedasticity (2)

4. Afterwards we perform random draws for and

5. Now the parameter can be drawn randomly according to their complete data posterior distribution

6. The steps 2 to 5 are repeated again 5,000 times and we use to obtain 5 complete data sets

),...,,()5000(*)2000(*)1000(*

iii yyy

RSSknXt )(~ 2)1(2

)ˆ(ˆ,ˆ(~ )()()1( ttt VN

),ˆ(~1

)1(2)()1()1(

t

izttt

eXXN

)(ˆ

2)(2 )ˆ()ˆˆexp(ln t

iziit

ii exyzRSSmit

2

Page 14: Thomas Büttner Institute for Employment Research (IAB) Susanne Rässler University of Bamberg

15

IAB Employment Sample 2000 (30 June 2000)

Only male persons from Western Germany

Only full time workers covered by social security

Simulation Study

About 210,000 Persons,about 23,000 or 11 percent with an income above the contribution limit

Page 15: Thomas Büttner Institute for Employment Research (IAB) Susanne Rässler University of Bamberg

16

Creating Complete Data Sets

As the IAB Employment Sample is censored, we first have to create complete data sets

We create two different data sets:

one data set using an approach presuming homoscedasticity

another data set using an approach considering heteroscedasticity of the residuals

)ˆ,ˆ(~ 2xNynew

)ˆ,ˆ(~ 2inew xNy

Page 16: Thomas Büttner Institute for Employment Research (IAB) Susanne Rässler University of Bamberg

17

.

1. IABS with censored wages

2. Creating complete data sets (with and without heteroscedasticity), calculating β

3. Defining a new limit

4. Drawing a random sample of 10 percent

5. Imputing the wage using the different approaches, computing a regression

Simulation Study

6. Calculating the fraction of confidence intervals of containing the true parameter β for the different approaches

Page 17: Thomas Büttner Institute for Employment Research (IAB) Susanne Rässler University of Bamberg

18

Results of the Homoscedastic Data Set

HOM

  complete data SI SI-Het MI MI-Het

    coverage   coverage   coverage   coverage   coverage

educ1 0.1068 0.1069 0.959 0.1074 0.951 0.1073 0.95 0.1074 0.958 0.1073 0.958

educ2 0.1791 0.1790 0.965 0.1792 0.953 0.1790 0.952 0.1792 0.965 0.1790 0.961

educ3 0.1305 0.1310 0.954 0.1317 0.939 0.1330 0.935 0.1318 0.955 0.1330 0.957

educ4 0.2621 0.2623 0.963 0.2624 0.928 0.2654 0.888 0.2624 0.957 0.2653 0.949

educ5 0.4445 0.4446 0.948 0.4409 0.868 0.4466 0.759 0.4410 0.944 0.4469 0.922

educ6 0.5098 0.5096 0.962 0.5064 0.852 0.5121 0.719 0.5065 0.953 0.5118 0.929

level1 0.5449 0.5441 0.949 0.5440 0.952 0.5447 0.95 0.5440 0.949 0.5446 0.95

level2 0.6517 0.6512 0.95 0.6515 0.954 0.6524 0.951 0.6515 0.952 0.6523 0.951

level3 0.8958 0.8950 0.948 0.8973 0.95 0.8958 0.936 0.8976 0.948 0.8959 0.954

level4 0.8962 0.8956 0.953 0.8961 0.95 0.8962 0.949 0.8962 0.951 0.8963 0.951

age 0.0498 0.0498 0.955 0.0500 0.943 0.0500 0.93 0.0500 0.964 0.0500 0.957

sqage -0.0005 -0.0005 0.958 -0.0005 0.936 -0.0005 0.922 -0.0005 0.962 -0.0005 0.96

nation -0.0329 -0.0327 0.962 -0.0334 0.948 -0.0334 0.942 -0.0335 0.953 -0.0334 0.955

cons 2.4424 2.4433 0.953 2.4406 0.945 2.4405 0.932 2.4411 0.951 2.4406 0.949

Page 18: Thomas Büttner Institute for Employment Research (IAB) Susanne Rässler University of Bamberg

19

Results of the Heteroscedastic Data Set

HET

  complete data SI SI-Het MI MI-Het

    coverage   coverage   coverage   coverage   coverage

educ1 0.1141 0.1145 0.952 0.1271 0.794 0.1136 0.945 0.1272 0.804 0.1136 0.955

educ2 0.1912 0.1915 0.955 0.2075 0.616 0.1903 0.948 0.2076 0.632 0.1903 0.955

educ3 0.1442 0.1444 0.961 0.0947 0.745 0.1406 0.942 0.0952 0.769 0.1420 0.963

educ4 0.2685 0.2686 0.961 0.2753 0.913 0.2688 0.922 0.2754 0.937 0.2689 0.96

educ5 0.4433 0.4435 0.963 0.4790 0.366 0.4372 0.761 0.4796 0.478 0.4377 0.917

educ6 0.5241 0.5248 0.954 0.5117 0.785 0.5164 0.718 0.5121 0.869 0.5161 0.896

level1 0.5422 0.5426 0.955 0.5415 0.946 0.5422 0.947 0.5416 0.946 0.5417 0.953

level2 0.6405 0.6411 0.95 0.6430 0.944 0.6412 0.944 0.6430 0.947 0.6407 0.95

level3 0.8856 0.8864 0.945 0.8780 0.941 0.8845 0.945 0.8782 0.948 0.8838 0.952

level4 0.8903 0.8908 0.952 0.8737 0.941 0.8919 0.943 0.8737 0.941 0.8913 0.951

age 0.0432 0.0431 0.955 0.0457 0.645 0.0431 0.948 0.0457 0.679 0.0431 0.97

sqage -0.0004 -0.0004 0.96 -0.0005 0.59 -0.0004 0.941 -0.0005 0.623 -0.0004 0.968

nation -0.0223 -0.0218 0.961 -0.0297 0.872 -0.0222 0.945 -0.0296 0.882 -0.0222 0.954

cons 2.5858 2.5865 0.947 2.5318 0.909 2.5868 0.945 2.5315 0.914 2.5875 0.952

Page 19: Thomas Büttner Institute for Employment Research (IAB) Susanne Rässler University of Bamberg

20

Simulation study using external wage information (1)

Scientific-Use-File of the German Structure of Earnings Survey (GSES) 2001

Linked Employer-Employee data set

Information on about 22.000 establishments and about 846.000 employees

Information on

- individuals (e.g. sex, age, education)

- jobs (e.g. occupation, job level, working times)

- income (e.g. gross wage, net wage, income taxes)

- and establishments

Page 20: Thomas Büttner Institute for Employment Research (IAB) Susanne Rässler University of Bamberg

21

Simulation study using external wage information (2)

Selection of a sample comparable to the first simulation study

Complete data set containing 382.710 persons

Censoring at the 85 percent quantile

Page 21: Thomas Büttner Institute for Employment Research (IAB) Susanne Rässler University of Bamberg

22

coverage coverage coverageeduc2 0.0471 0.0472 0.946 0.0476 0.951 0.0474 0.952educ3 0.0933 0.0929 0.929 0.0709 0.841 0.0783 0.897educ4 0.1067 0.1069 0.907 0.0863 0.559 0.0894 0.674educ5 0.2095 0.2100 0.934 0.2086 0.963 0.2164 0.930educ6 0.2822 0.2826 0.906 0.2501 0.181 0.2685 0.790level3 0.0183 0.0181 0.949 0.0165 0.946 0.0164 0.944level4 0.0862 0.0833 0.951 0.0828 0.967 0.0848 0.964level5 0.0686 0.0652 0.951 0.0390 0.880 0.0359 0.856group2 -0.1378 -0.1377 0.956 -0.1338 0.912 -0.1336 0.909group3 -0.2691 -0.2691 0.955 -0.2634 0.874 -0.2631 0.870group4 -0.4151 -0.4150 0.951 -0.4108 0.927 -0.4104 0.920group5 0.6083 0.6117 0.942 0.5689 0.788 0.5797 0.869group6 0.1925 0.1956 0.951 0.2091 0.939 0.2118 0.928group7 0.0449 0.0482 0.950 0.0741 0.879 0.0763 0.862group8 -0.2738 -0.2701 0.952 -0.2421 0.863 -0.2393 0.843group9 -0.4865 -0.4840 0.947 -0.4578 0.901 -0.4554 0.895age 0.0332 0.0332 0.937 0.0336 0.953 0.0334 0.958sqage -0.0003 -0.0003 0.934 -0.0003 0.857 -0.0003 0.918region2 0.0491 0.0492 0.933 0.0651 0.123 0.0605 0.414region3 0.0060 0.0060 0.929 0.0147 0.623 0.0108 0.859region4 0.0732 0.0731 0.937 0.0711 0.931 0.0672 0.753contract -0.1659 -0.1660 0.959 -0.1624 0.947 -0.1619 0.943cons 3.8467 3.8461 0.939 3.8525 0.953 3.8572 0.952

complete data M I MI-Het

Simulation study using external wage information (3)

Page 22: Thomas Büttner Institute for Employment Research (IAB) Susanne Rässler University of Bamberg

24

ReferencesBender, S., Haas, A. and Klose, C. (2000). IAB Employment Subsample 1975-1995. Opportunities for Analysis Provided by Anonymised Subsample. IZA Discussion Paper117, IZA Bonn.

Buchinsky, M. (1994). Changes in the U.S. wage structure 1963–1987: Applicationof quantile regression. Econometrica 62(2), 405–458.

Gartner, H. (2005). The imputation of wages above the contribution limit with the GermanIAB employment sample. FDZ Methodenreport 2/2005.

Gartner, H. and Rässler, S. (2005). Analyzing the changing gender wage gapbased on multiply imputed right censored wages. IAB Discussion Paper 05/2005.

Jensen, U., Gartner, H. and Rässler, S. (2006). Measuring overeducation with earnings frontiers and multiply imputed censored income data. IAB Discussion Paper Nr. 11/2006.

Khan, S. and Powell, J.L. (2001). Two-step estimation of semiparametric censored regression models. Journal ofEconometrics 103, 73–110.

Little, R.J.A and Rubin D.R. (1987). Statistical Analysis with Missing Data. John Wiley,New York, 1 edn.

Meng, X.L. (1994). Multiple Imputation Inferences with Uncongenial Sources of Input. Statistical Sciences Volume 9, 538-558.

Powell, J.L. (1986). Symmetrically Trimmed Least Squares Estimation for Tobit Models. Econometrica 54(6),1435-1460.

Rässler, S. (2006). Der Einsatz von Missing Data Techniken in der Arbeitsmarktforschung des IAB. Allgemeines Statistisches Archiv.

Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. J.Wiley & Sons, New York.

Schafer, J.L. and Yucel, R.M (2002). Computational Strategies for Multivariate Linear Mixed-Effects Models With Missing Values. Journal of Computational and Graphical Statistics Volume 11 437-457.

Schafer, J.L. (1997). Analysis of Incomplete Multivariate Data. Chapman & Hall, New York.

Page 23: Thomas Büttner Institute for Employment Research (IAB) Susanne Rässler University of Bamberg

25

Combining Rules

m

t

tMI m 1

)(ˆ1ˆ

m

t

traVm

W1

)()ˆ(ˆ1

BmmWT 1

m

tMI

t

mB

1

2)( )ˆˆ(1

1

• The associated variance estimate has two components. The within-imputation-variance is the average of the complete-data-variance estimates:

• The between-imputation-variance is the variance of the complete-data point estimates:

• The total variance is defined as:

• Multiple Imputation point estimate for is defined as:

Page 24: Thomas Büttner Institute for Employment Research (IAB) Susanne Rässler University of Bamberg

26

First Results The simulation study using these three approaches shows the necessity of a new method

that multiply imputes the missing wages and does not presume heteroscedasticity

Second step: Development of a new multiple imputation approach considering

heteroscedasticity

Finally we perform a new simulation study to compare the four approaches under different

situations in order to confirm the necessity as well as the validity of the new approach

Page 25: Thomas Büttner Institute for Employment Research (IAB) Susanne Rässler University of Bamberg

29

Simulation Study (2) The simulation procedure consisting of

drawing a random sample, deleting the wages above the limit imputing the data using the different approaches, computing a regression, and calculating the confidence intervals

is repeated 1000 times.

Coverage: The fraction of confidence intervals of containing the true parameter β for the different approaches

Page 26: Thomas Büttner Institute for Employment Research (IAB) Susanne Rässler University of Bamberg

30

Summary of Results

In case of a homoscedastic structure of the residuals the same quality of imputation results

can be expected from the two multiple imputation approaches

In case of heteroscedasticity the simulation study confirms the necessity of our new

approach

Since the structure of the wages in the IAB employment register is heteroscedastic, the

results of the simulation study necessitate the use of the new approach to impute the

missing wage information in this register