39
INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

Embed Size (px)

DESCRIPTION

Coverage What is the target (theoretical) population? Does the statistical frame fully cover this population? What are the known deficiencies in the frame? Can these be addressed with other data? 3/16/2011 (c) John M. Abowd and Lars Vilhuber 2011, all rights reserved 3

Citation preview

Page 1: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

INFO 4470/ILRLE 4470 Visualization Tools and Data

Quality

John M. Abowd and Lars VilhuberMarch 16, 2011

Page 2: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

2

Outline

• Data quality issues• Coverage• Known biases• Missing data• Extended example using the Quarterly

Workforce Indicators

3/16/2011

Page 3: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

3

Coverage

• What is the target (theoretical) population?• Does the statistical frame fully cover this

population?• What are the known deficiencies in the

frame?• Can these be addressed with other data?

3/16/2011

Page 4: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

4

Biases and Sampling Error

• Bias: the statistical model does not estimate the quantity of interest in expected value

• Sampling error: the statistical model has imprecision in estimating the quantity of interest

• Non-sampling error: variation in the statistical model introduced by editing, imputation, non-random non-response, etc.

3/16/2011

Page 5: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

5

What are Missing Data?

• Random or systematic missing items or entities in a statistical sample or frame

• One of the most ubiquitous data quality problems known

• Cannot be ignored, or rather

3/16/2011

Page 6: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

6

How to code missing data?

• How about giving a zero, “0” for missing?• Note with a distinctive value such as:– -9 for non-responses– -8 for invalid responses– Or even “.”

• Statistical software like SAS, Stata and SPSS can then recognize those values as missing and treat them according to your instructions

3/16/2011

Page 7: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

7

How to Handle Missing Data

• It depends on the type of analysis:– Simple summary tabulations– Descriptive stats, including mean and standard

deviation– Correlation, relation between two or more

variables– Model to estimate the effects of characteristics

3/16/2011

Page 8: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

Missing vs Non-missing

"There is no difference between cases with missing versus observed

data"

Page 9: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

9

“Inference and Missing Data” Rubin (1976)

• MCAR – Missing Completely At Random– Rigorous and hard to achieve– Sometimes by design

• MAR – Missing At Random– Less Rigorous– Cannot be tested without the data which are

missing

3/16/2011

Page 10: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

10

Assume MAR?

• YES – then mechanism by which data are missing is IGNORABLE

• NO – then mechanism is not ignorable, requires more involved procedures

3/16/2011

Page 11: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

11

MAR - YES

Delete• Listwise

• Pairwise

Impute• Assign Mean

• Hot Deck

• Maximum Likelihood

• Multiple Imputation

3/16/2011

Page 12: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

Visualization of Missing Data in the Quarterly Workforce Indicators

Page 13: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

13

Data Sources (Quarterly Only)

• Quarterly Workforce Indicators (Census Bureau)

• Quarterly Census of Employment and Wages (BLS)

• Business Employment Dynamics (BLS)• Job Openings and Labor Turnover Survey• Adjustments by to JOLTS by Davis, Faberman,

Haltiwanger, and Rucker (2010)

3/16/2011

Page 14: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

Quarterly Workforce Indicators I

• Flows are based on longitudinally linked (by employer and employee) Unemployment Insurance Wage Records

• Beginning-of-quarter employed if wage record with earnings > $1.00 in quarters t-1 and t (B)

• End-of-quarter employed if wage record with earnings > $1.00 in quarters t and t+1 (E)

• Accession if wage record in t but not t-1 (A)• Separation if wage record in t but not t+1 (S)

Page 15: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

15

Quarterly Workforce Indicators II

• Job creation if establishment has positive employment change from beginning to end of quarter (JC)

• Job destruction if establishment has negative employment change from beginning to end of quarter (JD), always stated as absolute value of change

• Demographic (age x gender), geographic (county, CBSA, WIB), NAICS (sector, sub-sector, industry group), and ownership (All, private)3/16/2011

Page 16: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

16

Quarterly Census of Employment and Wages

• Stocks of employment measured as of the 12th day of the month for each month in the quarter for each establishment (no job level data)

• BLS uses month-3 employment to measure changes• Beginning of quarter employment is month-3

employment from quarter t-1• End of quarter employment is month-3 employment

for quarter t• Geographic (county, MSA), NAICS (sector, sub-sector,

industry group), and ownership (all, public, private)

3/16/2011

Page 17: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

17

Business Employment Dynamics

• Gross job gains (job creations) is the change in employment at an establishment between month-3 in quarter t-1 and month-3 in quarter t, if positive

• Gross job losses (job destructions) is the absolute value of the change in employment at an establishment between month-3 in quarter t-1 and month-3 in quarter t, if negative

• Limited geography (state), NAICS (sector) detail3/16/2011

Page 18: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

18

Job Openings and Labor Turnover Survey

• Monthly survey of continuing establishments• Accessions measured as all new employment over

the course of the month (summed over the three months in the quarter here)

• Separations measured as quits, layoffs, discharges, and other over the course of the month (summed over the three months of the quarter here)

• Limited geography (national) and NAICS (sector) detail

3/16/2011

Page 19: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

19

QWI Coverage of the Private Workforce

Sub-period 1

Sub-period 2

3/16/2011

Page 20: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

20

Worker Reallocation Rate

• This rate is available in the QWI for 8 age groups, both genders, NAICS sector, state and quarter from 1990:Q1 to 2008:Q4 (more detail is available than we used)

2agsktagskt

agsktagsktagskt EB

SAWRR

3/16/2011

Page 21: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

21

Job Reallocation Rate

• This rate is available in the QWI for 8 age groups, both genders, NAICS sector, state and quarter from 1990:Q1 to 2008:Q4 (more detail is available than we used)

2agsktagskt

agsktagsktagskt EB

JDJCJRR

3/16/2011

Page 22: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

22

Excess Reallocation Rate

• ERR = WRR – JRR• The excess reallocation rate measures the extent

to which gross worker flows exceed the minimum required to service the gross job flows

• This has been very difficult to estimate nationally because there were no data collected on a consistent basis for all the component flows

• QWI solved that problem

3/16/2011

Page 23: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

23

Statistical Methodology

• Divide the analysis into two periods– 1993:Q1-2001:Q4 (early period, many states are completely

missing, 10 states complete)– 1999:Q1-2008:Q4 (later period, 37 states are complete)

• For each sub-period use a multiple imputation model to complete the missing data

• For the overlap period, use a ramped weight to compute the average implicate combining the two periods

• Use the standard multiple imputation formulae to combine implicates

3/16/2011

Page 24: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

24

National Estimates

• The combining formula for producing the national WRR is shown above (similar formulae apply to other rates)

WRRag+kt = 1MP M

`=1

2664P

8s

µB ( `)agsk t +E ( `)

agsk t2

¶WRR ( `)

agsk t

P8v

µB ( `)agvk t +E ( `)

agvk t2

3775

agkstsk

agkstagkst

agtagt

agtagt

agtagtagt

WRREB

EB

EBSA

WRR

, 22

2

3/16/2011

Page 25: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

25

Implicate Combining Formulae I

V (`)hdWRRag+kt

i= 1

49P

8s

µB (`)agsk t +E (`)

agsk t2

¶³WRR (`)

agsk t ¡ dWRR ( `)ag+k t

´ 2

P8v

µB (`)agvk t +E ( `)

agvk t2

¶ :

B £WRRag+kt¤= 1

M ¡ 1P M

`=1

µdWRR (`)

ag+kt ¡ WRRag+kt

¶2

M

s ktagktag

agsktagsktagskt

agktEB

WRREB

MWRR

1

2

21

s ktagktag

agsktagsktagskt

agkt EB

WRREB

RRW

2

2

3/16/2011

Page 26: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

26

Implicate Combining Formulae II

T £WRRag+kt¤= 1

MP M

`=1V (`)hdWRRag+kt

i+ M +1

M B £WRRag+kt¤

MR £WRRag+kt¤= B [WRR ag+k t ]

T [WRR ag+k t ]

df £WRRag+kt¤= (M ¡ 1)

µ1+ 1

M +11MP M

`=1 V( `)£dWRR ag+k t

¤B [WRR ag+k t ]

¶2:

s ktagktag

agktagsktagsktagskt

agkt EB

RRWWRREB

RRWV

2

2491

2

M

agktagktagkt WRRRRWM

WRRB1

2

11

agkt

M

agktagkt WRRBMMRRWV

MWRRT

1

11

3/16/2011

Page 27: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

27

Implicate Combining Formulae III

T £WRRag+kt¤= 1

MP M

`=1V (`)hdWRRag+kt

i+ M +1

M B £WRRag+kt¤

MR £WRRag+kt¤= B [WRR ag+k t ]

T [WRR ag+k t ]

df £WRRag+kt¤= (M ¡ 1)

µ1+ 1

M +11MP M

`=1 V( `)£dWRR ag+k t

¤B [WRR ag+k t ]

¶2:

agkt

agktagkt

WRRTWRRBWRRMR

3/16/2011

Page 28: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

28

Results

• Comparison of WRR between QWI and JOLTS• Comparison of JRR between QWI and BED• Demographic detail• Industry detail

3/16/2011

Page 29: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

29

WRR: QWI v. JOLTS

3/16/2011

Page 30: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

30

WRR: QWI v. Adjusted JOLTS

3/16/2011

Page 31: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

31

JRR: QWI v. BED

3/16/2011

Page 32: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

32

JRR WRR ERR by Gender

3/16/2011

Page 33: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

33

ERR (Churning) by Age Group

3/16/2011

Page 34: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

34

Job Creation Rate

3/16/2011

Page 35: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

35

Job Destruction Rate

3/16/2011

Page 36: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

36

Job Reallocation Rate

3/16/2011

Page 37: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

37

Accession Rate

3/16/2011

Page 38: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

38

Separation Rate

3/16/2011

Page 39: INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011

(c) John M. Abowd and Lars Vilhuber 2011, all rights reserved

39

Worker Reallocation Rate

3/16/2011