31
Variance Estimation: Drawing Statistical Inferences from IPUMS- International Census Data Lara L. Cleveland IPUMS-International November 14, 2010 Havana, Cuba

Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

  • Upload
    oya

  • View
    34

  • Download
    0

Embed Size (px)

DESCRIPTION

Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data. Lara L. Cleveland IPUMS-International November 14, 2010 Havana, Cuba. Overview. Characteristics of Complex Samples Public Use Census Data IPUMS-International Census Samples - PowerPoint PPT Presentation

Citation preview

Page 1: Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

Variance Estimation:Drawing Statistical Inferences from IPUMS-International Census Data

Lara L. ClevelandIPUMS-International

November 14, 2010Havana, Cuba

Page 2: Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

Overview Characteristics of Complex Samples

Public Use Census Data IPUMS-International Census Samples

Adjusting for Sampling ErrorAssessment StrategyResultsRecommendations and Future Work

The IPUMS projects are funded by the National Science Foundation and the National Institutes of Health

Page 3: Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

Overview Characteristics of Complex Samples

Public Use Census Data IPUMS-International Census Samples

Adjusting for Sampling ErrorAssessment StrategyResultsRecommendations and Future Work

The IPUMS projects are funded by the National Science Foundation and the National Institutes of Health

Page 4: Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

Public Use Census Microdata

Publicly available census microdata often derive from complex samples.

HOWEVER,social science researchers commonly apply

methods designed for simple random samples.

Page 5: Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

Public Use Census Data: Complex Samples

Clustering By household (sample households rather than individuals) Some samples geographically clustered Can result in underestimated standard errors

Differential weighting Oversample select populations Also leads to underestimated standard errors

Stratification Explicitly by person or household characteristics Implicitly by geographical area Can result in overestimated standard errors

underestimated standard errors

underestimated standard errors

overestimated standard errors

Page 6: Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

IPUMS-I Data Processing

Data received varies in quality, detail and extent of documentation

3 Sampling ProcessesCountry-produced public use sampleSample drawn by partner country to IPUMS-I

specificationsFull count data sampled by IPUMS-I

Page 7: Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

Samples Drawn by IPUMS-I

High density (typically 10% samples) Household samples

Clustered by household Systematic sample (every nth household)

Typically geographic sorting – presumed here Implicit geographic stratification

Uniformly weighted (self-weighting)

Page 8: Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

Variance Estimation: Data Quality Assessment/Improvement As researchers and data users

Assess accuracy of the dataCalculate precise estimates

As data custodians and disseminatorsDistribute quality data samplesCreate tools to facilitate research

Page 9: Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

Overview Characteristics of Complex Samples

Public Use Census Data IPUMS-International Census Samples

Adjusting for Sampling ErrorAssessment StrategyResultsRecommendations and Future Work

The IPUMS projects are funded by the National Science Foundation and the National Institutes of Health

Page 10: Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

Assessment StrategyCreate or specify variables to account for sampling

error for use in current statistical packages Cluster (Household identifier) Strata (Pseudo-strata)

Compare estimates from full count data to estimates from sample data using 3 methods: Subsample Replicate Taylor Series Linearization Simple Random Sample (SRS)

Page 11: Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

Assessing Accuracy: Full Count Data

“True” or “Gold Standard” Estimates Full count census data Simulate sample design

100 – 10% replicates Estimate the mean and standard error of the mean for

several household and person-level variables

Recent census data from 4 countries: Bolivia 2001, Ghana 2000, Mongolia 2000, Rwanda 2002

Full count, clean, well formatted data requiring no special corrections

Page 12: Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

Assessing Accuracy: Sample DataSub-sample Replicate

Mimic sample design – 100 10% subsamples Labor and resource heavy

Taylor Series Linearization Clustering: household identifier Stratification: pseudo-strata variable

10 adjacent households within geographic unit Incomplete strata pooled with preceding strata

Available in most statistical packagesSimple Random Sample as control/comparison

Page 13: Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

Overview Characteristics of Complex Samples

Public Use Census Data IPUMS-International Census Samples

Adjusting for Sampling ErrorAssessment StrategyResultsRecommendations and Future Work

The IPUMS projects are funded by the National Science Foundation and the National Institutes of Health

Page 14: Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

Table FormatFrom Full Count Data – “Gold Standard”1) Full Count Mean2) S.E. of mean from Full Count Replicate

From Sample Data: Ratios of Standard Errors3) SE(Sub-sample Replicate) / SE(Full Count Replicate)

4) SE(Sample Taylor Series) / SE(Full Count Replicate)

5) SE(SRS) / SE(Full Count Replicate)

Ratios~1.0: Sample estimate resembles “true” value>1.0: Sample estimate overestimates SE<1.0: Sample estimate underestimates SE

Page 15: Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

Full Count Parameter Estimate

SE from Full Count Replicate

Ratio of (SE) Estimates: 10% Sample to Full Count Replicate

Selected Characteristics Subsample Replicate

Taylor Series with Pseudo-Strata

Simple Random Sample

Household (1) (2) (3) (4) (5)HH Size (mean) 4.71 0.005 0.8 0.9 0.9Electric Light (%) 4.18 0.034 0.9 0.9 1.3Toilet (%) 0.38 0.013 0.9 0.9 1.0Radio (%) 43.11 0.103 0.9 1.0 1.0Earth Floor (%) 85.28 0.073 0.8 0.9 1.0Home Ownership (%) 86.41 0.056 1.1 1.1 1.3Non-relatives (mean) 0.30 0.002 1.1 1.0 1.1

Person Subsample ReplicatePseudostrata and

HH ClusterSimple Random

SampleAge (mean) 20.77 0.015 0.9 1.0 1.1Sex (%) 46.81 0.045 0.9 1.0 1.1Religion Catholic (%) Protestant (%)

46.69 26.16

0.1000.077

1.01.1

1.01.1

0.50.6

Married (%) 17.64 0.039 0.9 1.0 1.0Literate (%) 39.75 0.060 0.9 0.9 0.8Employed (%) 40.94 0.048 0.9 0.9 1.0

Table 1. Rwanda 2002: Comparing Complete Count and Sample Standard Error Estimates

~1.0

~1.0 ~1.0

~1.0

Page 16: Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

Full Count Parameter Estimate

SE from Full Count Replicate

Ratio of (SE) Estimates: 10% Sample to Full Count Replicate

Selected Characteristics Subsample Replicate

Taylor Series with Pseudo-Strata

Simple Random Sample

Household (1) (2) (3) (4) (5)HH Size (mean) 4.45 0.008 0.9 0.9 1.0Electric Light (%) 67.53 0.098 1.1 1.0 1.8Toilet (%) 62.46 0.135 1.1 1.2 1.4Kitchen (%) 39.08 0.145 1.0 1.0 1.3Bathroom (%) 21.74 0.096 1.0 1.1 1.5Phone(%) 17.01 0.136 1.0 1.0 1.1Non-relatives (mean) 0.11 0.002 0.9 1.0 1.0

Person Subsample ReplicatePseudostrata and

HH ClusterSimple Random

SampleAge (mean) 24.57 0.034 1.0 1.0 1.0Sex (%) 49.47 0.078 0.9 1.0 1.2Ethnicity Khalkh (%) Kazak (%)

81.59 4.28

0.1110.047

0.91.0

1.01.1

0.60.8

Married (%) 32.33 0.081 0.9 1.0 1.1Literate (%) 81.56 0.071 1.1 1.0 1.0Employed (%) 32.47 0.095 0.9 0.9 0.9

Table 2. Mongolia 2000: Comparing Complete Count and Sample Standard Error Estimates

~1.0

~1.0 ~1.0

~1.0

Page 17: Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

Full Count Parameter Estimate

SE from Full Count Replicate

Ratio of (SE) Estimates: 10% Sample to Full Count Replicate

Selected Characteristics Subsample Replicate

Taylor Series with Pseudo-Strata

Simple Random Sample

Household (1) (2) (3) (4) (5)HH Size (mean) 3.93 0.0046 1.0 1.0 1.1Electric Light (%) 60.51 0.0536 1.1 1.2 1.9Toilet (%) 59.48 0.0649 1.0 1.1 1.6Kitchen (%) 70.62 0.0882 0.9 1.0 1.1Phone (%) 21.33 0.0605 1.3 1.1 1.4Radio (%) 71.17 0.0819 0.9 1.0 1.1Earth Floor (%) 35.66 0.0519 1.2 1.3 1.9Non-relatives (mean) 0.19 0.0012 1.0 1.0 1.1Person Subsample P-S and Cluster Simple RandomAge (mean) 24.70 0.0004 1.0 1.1 1.0Sex (%) 49.84 0.0024 0.9 0.9 1.1Ethnicity Quechua (%) Aymara (%)

30.69 25.19

0.00530.0047

1.00.8

1.00.9

0.80.8

Married (%) 26.09 0.0023 0.9 1.0 1.0Literate (%) 74.99 0.0025 0.9 0.9 0.9Employed (%) 34.37 0.0022 1.1 1.1 1.0

Table 3. Bolivia 2001: Comparing Complete Count and Sample Standard Error Estimates

?

?

?

~1.0 ~1.0

~1.0 ~1.0

Page 18: Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

Taylor Series Linearization SRS

Person Mean

SE (Full Count Replicate)

Adjusted for Clustering and

Implicit Stratification

Effect of Clustering

(Adjusted for Strata Only)

Effect of Stratification (Adjusted for Cluster

Only)

Combined Effect of

Clustering and Stratification

Age (mean) 24.7 0.0004 1.1 1.0 1.1 1.0

Sex (%) 49.8 0.0024 0.9 1.1 0.9 1.1

Ethnicity Quechua (%) Aymara (%)

30.725.2

0.0053 0.0047

1.00.9

0.60.5

1.41.4

0.80.8

Married (%) 26.1 0.0023 1.0 1.0 1.0 1.0

Literate (%) 75.0 0.0025 0.9 0.9 1.0 0.9

Worked (%) 34.4 0.0022 1.1 1.0 1.2 1.0

Table 4. Bolivia 2001: Decomposition of Clustering and Stratification Effects on Taylor Series Standard Error Estimates from the 10% Sample

?

Page 19: Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

Full Count Parameter Estimate

SE from Full Count Replicate

Ratio of (SE) Estimates: 10% Sample to Full Count Replicate

Selected Characteristics Subsample Replicate

Taylor Series with Pseudo-Strata

Simple Random Sample

Household (1) (2) (3) (4) (5)HH Size (mean) 4.99 0.005 1.1 1.0 1.0Electric Light (%) 43.54 0.042 1.5 1.5 1.8Toilet (%) 8.49 0.026 1.2 1.5 1.7Kitchen (%) 46.17 0.062 1.2 1.2 1.2Bathroom (%) 23.47 0.046 1.5 1.4 1.4Non-relatives (mean) 0.14 0.001 0.9 1.0 1.0

Person Subsample ReplicatePseudostrata and

HH ClusterSimple Random

SampleAge (mean) 23.90 0.013 1.0 1.1 1.0Sex (%) 49.48 0.035 1.0 1.0 1.0Ethnicity Akan (%) Mole-dagbani (%)

45.2815.25

0.0660.051

0.91.0

1.01.0

0.50.5

Married (%) 29.28 0.029 1.2 1.2 1.1Literate (%) 34.00 0.038 1.0 1.1 0.9Employed (%) 42.44 0.038 1.3 1.1 0.9

Table 5. Ghana 2000: Comparing Complete Count and Sample Standard Error Estimates

~1.0 ~1.0

?

? ??

?

Page 20: Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

Overview Characteristics of Complex Samples

Public Use Census Data IPUMS-International Census Samples

Adjusting for Sampling ErrorAssessment StrategyResultsRecommendations and Future Work

The IPUMS projects are funded by the National Science Foundation and the National Institutes of Health

Page 21: Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

Recommendations: Clustering

Many research projects need not worry Subpopulations that rely on only one person per HH

(e.g., fertility, aging, some work-related studies) Design research to select a single person

from the household Use household identifier in stat packages Future: Variable that includes identifier for

geographic clustering as needed

Page 22: Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

Recommendations: Stratification

Most researchers need no modificationStratification increases precisionEstimates are conservative

If concerned, use pseudo-strata Investigations of weak relationshipsFor some sub-population studies

Future: Pseudo-strata variable to specify information about implicit stratification

Page 23: Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

Recommendations: Web Guidance

Page 24: Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

Recommendations: Web Guidance

Page 25: Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

Current and future work

Determine optimal pseudo-strata size Investigate Ghana data distribution

Seek more geographic detail in the data Compare estimates to published

population counts Additional data quality tests

Page 26: Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

Thank you!

Questions?

Lara L. ClevelandIPUMS International

Minnesota Population CenterUniversity of Minnesota50 Willey Hall225 – 19th Avenue SouthMinneapolis, MN 55455

Page 27: Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

Full Count Parameter Estimate

SE from Full Count Replicate

Ratio of (SE) Estimates: 10% Sample to Full Count Replicate

Selected Characteristics Subsample Replicate

Taylor Series with Pseudo-Strata

Simple Random Sample

Household (1) (2) (3) (4) (5)HH Size (mean) 4.71 0.005 0.8 0.9 0.9Electric Light (%) 4.18 0.034 0.9 0.9 1.3Toilet (%) 0.38 0.013 0.9 0.9 1.0Radio (%) 43.11 0.103 0.9 1.0 1.0Earth Floor (%) 85.28 0.073 0.8 0.9 1.0Home Ownership (%) 86.41 0.056 1.1 1.1 1.3Non-relatives (mean) 0.30 0.002 1.1 1.0 1.1

Person Subsample ReplicatePseudostrata and

HH ClusterSimple Random

SampleAge (mean) 20.77 0.015 0.9 1.0 1.1Sex (%) 46.81 0.045 0.9 1.0 1.1Religion Catholic (%) Protestant (%)

46.69 26.16

0.1000.077

1.01.1

1.01.1

0.50.6

Married (%) 17.64 0.039 0.9 1.0 1.0Literate (%) 39.75 0.060 0.9 0.9 0.8Employed (%) 40.94 0.048 0.9 0.9 1.0

Table 1. Rwanda 2002: Comparing Complete Count and Sample Standard Error Estimates

Page 28: Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

Full Count Parameter Estimate

SE from Full Count Replicate

Ratio of (SE) Estimates: 10% Sample to Full Count Replicate

Selected Characteristics Subsample Replicate

Taylor Series with Pseudo-Strata

Simple Random Sample

Household (1) (2) (3) (4) (5)HH Size (mean) 4.45 0.008 0.9 0.9 1.0Electric Light (%) 67.53 0.098 1.1 1.0 1.8Toilet (%) 62.46 0.135 1.1 1.2 1.4Kitchen (%) 39.08 0.145 1.0 1.0 1.3Bathroom (%) 21.74 0.096 1.0 1.1 1.5Phone(%) 17.01 0.136 1.0 1.0 1.1Non-relatives (mean) 0.11 0.002 0.9 1.0 1.0

Person Subsample ReplicatePseudostrata and

HH ClusterSimple Random

SampleAge (mean) 24.57 0.034 1.0 1.0 1.0Sex (%) 49.47 0.078 0.9 1.0 1.2Ethnicity Khalkh (%) Kazak (%)

81.59 4.28

0.1110.047

0.91.0

1.01.1

0.60.8

Married (%) 32.33 0.081 0.9 1.0 1.1Literate (%) 81.56 0.071 1.1 1.0 1.0Employed (%) 32.47 0.095 0.9 0.9 0.9

Table 2. Mongolia 2000: Comparing Complete Count and Sample Standard Error Estimates

Page 29: Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

Full Count Parameter Estimate

SE from Full Count Replicate

Ratio of (SE) Estimates: 10% Sample to Full Count Replicate

Selected Characteristics Subsample Replicate

Taylor Series with Pseudo-Strata

Simple Random Sample

Household (1) (2) (3) (4) (5)HH Size (mean) 3.93 0.0046 1.0 1.0 1.1Electric Light (%) 60.51 0.0536 1.1 1.2 1.9Toilet (%) 59.48 0.0649 1.0 1.1 1.6Kitchen (%) 70.62 0.0882 0.9 1.0 1.1Phone (%) 21.33 0.0605 1.3 1.1 1.4Radio (%) 71.17 0.0819 0.9 1.0 1.1Earth Floor (%) 35.66 0.0519 1.2 1.3 1.9Non-relatives (mean) 0.19 0.0012 1.0 1.0 1.1Person Subsample P-S and Cluster Simple RandomAge (mean) 24.70 0.0004 1.0 1.1 1.0Sex (%) 49.84 0.0024 0.9 0.9 1.1Ethnicity Quechua (%) Aymara (%)

30.69 25.19

0.00530.0047

1.00.8

1.00.9

0.80.8

Married (%) 26.09 0.0023 0.9 1.0 1.0Literate (%) 74.99 0.0025 0.9 0.9 0.9Employed (%) 34.37 0.0022 1.1 1.1 1.0

Table 3. Bolivia 2001: Comparing Complete Count and Sample Standard Error Estimates

Page 30: Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

Taylor Series Linearization SRS

Person Mean

SE (Full Count Replicate)

Adjusted for Clustering and

Implicit Stratification

Effect of Clustering

(Adjusted for Strata Only)

Effect of Stratification (Adjusted for Cluster

Only)

Combined Effect of

Clustering and Stratification

Age (mean) 24.7 0.0004 1.1 1.0 1.1 1.0

Sex (%) 49.8 0.0024 0.9 1.1 0.9 1.1

Ethnicity Quechua (%) Aymara (%)

30.725.2

0.0053 0.0047

1.00.9

0.60.5

1.41.4

0.80.8

Married (%) 26.1 0.0023 1.0 1.0 1.0 1.0

Literate (%) 75.0 0.0025 0.9 0.9 1.0 0.9

Worked (%) 34.4 0.0022 1.1 1.0 1.2 1.0

Table 4. Bolivia 2001: Decomposition of Clustering and Stratification Effects on Taylor Series Standard Error Estimates from the 10% Sample

Page 31: Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data

Full Count Parameter Estimate

SE from Full Count Replicate

Ratio of (SE) Estimates: 10% Sample to Full Count Replicate

Selected Characteristics Subsample Replicate

Taylor Series with Pseudo-Strata

Simple Random Sample

Household (1) (2) (3) (4) (5)HH Size (mean) 4.99 0.005 1.1 1.0 1.0Electric Light (%) 43.54 0.042 1.5 1.5 1.8Toilet (%) 8.49 0.026 1.2 1.5 1.7Kitchen (%) 46.17 0.062 1.2 1.2 1.2Bathroom (%) 23.47 0.046 1.5 1.4 1.4Non-relatives (mean) 0.14 0.001 0.9 1.0 1.0

Person Subsample ReplicatePseudostrata and

HH ClusterSimple Random

SampleAge (mean) 23.90 0.013 1.0 1.1 1.0Sex (%) 49.48 0.035 1.0 1.0 1.0Ethnicity Akan (%) Mole-dagbani (%)

45.2815.25

0.0660.051

0.91.0

1.01.0

0.50.5

Married (%) 29.28 0.029 1.2 1.2 1.1Literate (%) 34.00 0.038 1.0 1.1 0.9Employed (%) 42.44 0.038 1.3 1.1 0.9

Table 5. Ghana 2000: Comparing Complete Count and Sample Standard Error Estimates