Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
1
Sampling and Inference
The Quality of Data and Measures
2
Why we talk about sampling
• General citizen education• Understand data you’ll be using• Understand how to draw a sample, if you
need to• Make statistical inferences
3
Why do we sample?
N
Cost/benefit Benefit
(precision)
Cost(hassle factor)
4
How do we sample?
• Simple random sample– Variant: systematic sample with a random start
• Stratified• Cluster
5
Stratification
• Divide sample into subsamples, based on known characteristics (race, sex, religiousity, continent, department)
• Benefit: preserve or enhance variability
6
Cluster sampling
Block
HH Unit
Individual
7
Effects of samples
• Obvious: influences marginals• Less obvious
– Allows effective use of time and effort– Effect on multivariate techniques
• Sampling of independent variable: greater precision in regression estimates
• Sampling on dependent variable: bias
8
Sampling on Independent Variable
x
y
x
y
9
Sampling on Dependent Variable
x
y
x
y
10
Sampling
Consequences for Statistical Inference
11
Statistical Inference:Learning About the Unknown From the
Known• Reasoning forward: distributions of sample
means, when the population mean, s.d., and n are known.
• Reasoning backward: learning about the population mean when only the sample, s.d., and n are known
12
Reasoning Forward
13
First, we play with some simulations
• http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html
• http://www.kuleuven.ac.be/ucs/java/index.htm
14
Exponential Distribution Example
Fra
ctio
n
inc0 500000 1.0e+06
0
.271441
Mean = 250,000Median=125,000s.d. = 283,474Min = 0Max = 1,000,000
15
Consider 10 random samples, of n = 100 apiece
Sample mean1 253,396.92 198.789.63 271,074.24 238,928.75 280,657.36 241,369.87 249,036.78 226,422.79 210,593.410 212,137.3
Fra
ctio
n
inc0 250000 500000 1.0e+06
0
.271441
16
Consider 10,000 samples of n = 100
N = 10,000Mean = 249,993s.d. = 28,559Skewness = 0.060Kurtosis = 2.92
Fra
ctio
n
(mean) inc0 250000 500000 1.0e+06
0
.275972
17
Consider 1,000 samples of various sizes
10 100 1000
Mean =250,105s.d.= 90,891Skew= 0.38Kurt= 3.13
Mean = 250,498s.d.= 28,297Skew= 0.02Kurt= 2.90
Mean = 249,938s.d.= 9,376Skew= -0.50Kurt= 6.80
Fra
ctio
n
(mean) inc0 250000 500000 1.0e+06
0
.731
Fra
ctio
n
(mean) inc0 250000 500000 1.0e+06
0
.731
Fra
ctio
n
(mean) inc0 250000 500000 1.0e+06
0
.731
18
Difference of means example
Fra
ctio
n
inc0 250000 500000 1.0e+06
0
.280203
State 1Mean = 250,000
Fra
ctio
n
inc20 250000 500000 1.0e+06
0
.251984
State 2Mean = 300,000
19
Take 1,000 samples of 10, of each state, and compare them
First 10 samplesSample State 1 State 2
1 311,410 <<><><<<<>
365,2242 184,571 243,0623 468,574 438,3364 253,374 557,9095 220,934 189,6746 270,400 284,3097 127,115 210,9708 253,885 333,2089 152,678 314,88210 222,725 152,312
20
1,000 samples of 10(m
ean)
inc2
(mean) inc0 1.1e+06
0
1.1e+06
State 2 > State 1: 673 times
21
1,000 samples of 100(m
ean)
inc2
(mean) inc0 1.1e+06
0
1.1e+06
State 2 > State 1: 909 times
22
1,000 samples of 1,000(m
ean)
inc2
(mean) inc0 1.1e+06
0
1.1e+06
State 2 > State 1: 1,000 times
23
Another way of looking at it:The distribution of Inc2 – Inc1
n = 10 n = 100 n = 1,000
Mean = 51,845s.d. = 124,815
Mean = 49,704s.d. = 38,774
Mean = 49,816s.d. = 13,932
Fra
ctio
n
diff-400000 0 600000
0
.565
Fra
ctio
n
diff-400000 050000 600000
0
.565
Fra
ctio
n
diff-400000 050000 600000
0
.565
24
Reasoning Backward
µabout somethingsay obut want t , and ,X , knowyou When sn
25
Central Limit Theorem
As the sample size n increases, the distribution of the mean of a random sample taken from practically any population approaches a normaldistribution, with mean : and standard deviation
X
nσ
26
Calculating Standard Errors
In general:
ns
=err. std.
27
Most important standard errorsMean
Proportion
Diff. of 2 means
Regression (slope) coeff.
ns
npp )1( −
21
11nn
sp +
xsnres 1...×
28
Return to the aplets for the regression standard error
• http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html
• http://www.kuleuven.ac.be/ucs/java/index.htm
29
The Idea Behind Classical Hypothesis Testing
True mean or regression coefficient
Sample mean or regression coefficientH0 = 0
30
What We Know
• We know:– The sample mean/coeff. will not equal the
population mean/coeff.– The sample mean/coeff., sample s.d./s.e., & n
• The question:– Is the sample mean/coeff. “far” from H0 or
“close” to H0?
31
If n is sufficiently large, we know the distribution of sample means/coeffs. will
obey the normal curve
y
Mean
.000134
.398942
σ σ2 σ3 σ4σ−σ2−σ3−σ4− 68%
95%99%
32
If n is sufficiently large, we know the distribution of sample means/coeffs. will
obey the normal curve
y
Mean
.000134
.398942
σ σ2 σ3 σ4σ−σ2−σ3−σ4− 68%
95%99%
33
If n is sufficiently large, we know the distribution of sample means/coeffs. will
obey the normal curve
y
Mean
.000134
.398942
σ σ2 σ3 σ4σ−σ2−σ3−σ4− 68%
95%99%
34
If n is sufficiently large, we know the distribution of sample means/coeffs. will
obey the normal curve
y
Mean
.000134
.398942
σ σ2 σ3 σ4σ−σ2−σ3−σ4− 68%
95%99%
35
Therefore….
• When the sample size is large (i.e., > 150), convert the difference into z units and consult a z table
Z = (H1 - H0) / s.e.
36
Reading a z table
z table for standardized normal distribution. Image removed for copyright reasons.
37
Therefore….
• When the sample size is small (i.e., <150), convert the difference into t units and consult a t table
Z = (H1 - H0) / s.e.
38
t (when the sample is small)
z-4 -2 0 2 4
.000045
.003989
t-distribution
z (normal) distribution
39
Reading a t table
t table for standardized normal distribution. Image removed forcopyright reasons.
40
Doing a t-test
Frac
tion
diff9692-.2 0 .2 .4
0
.429558
Q: How likely is it that the residual vote rate in 1996 equal to the rate in 1992 (i.e., blank96-blank92= 0)?
Mean: 0.003069s.d.: 0.02323N: 1448
00061.01448/02323.0
/..
==
= nses
41
The pictureMean: 0.003069s.d.: 0.02323N: 1448
y
newz.003069.00246.00185.00124.000627.000017-.00059
.000134
.398942
00061.01448/02323.0
/..
==
= nses
028.500061.0
0003069.0
=
−=t
42
The STATA output. ttest blank96=blank92
Paired t test
------------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+--------------------------------------------------------------------blank96 | 1448 .0242941 .0005116 .0194689 .0232904 .0252977blank92 | 1448 .021225 .0005382 .0204813 .0201692 .0222808
---------+--------------------------------------------------------------------diff | 1448 .003069 .0006104 .0232279 .0018717 .0042664
------------------------------------------------------------------------------
Ho: mean(blank96 - blank92) = mean(diff) = 0
Ha: mean(diff) < 0 Ha: mean(diff) ~= 0 Ha: mean(diff) > 0t = 5.0278 t = 5.0278 t = 5.0278
P < t = 1.0000 P > |t| = 0.0000 P > t = 0.0000
. ttest diff9692=0
One-sample t test
------------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+--------------------------------------------------------------------diff9692 | 1448 .003069 .0006104 .0232279 .0018717 .0042664------------------------------------------------------------------------------Degrees of freedom: 1447
Ho: mean(diff9692) = 0
Ha: mean < 0 Ha: mean ~= 0 Ha: mean > 0t = 5.0278 t = 5.0278 t = 5.0278
P < t = 1.0000 P > |t| = 0.0000 P > t = 0.0000
.
43
Final t-testQ: Was there a relationship between residual vote and countySize in 1996?
Slope coeff: -0.07510s.e.r: 0.7115N: 1861Sx: 1.4788
01115.06762.001649.0
4788.11
18617115.0
1....
=×=
×=
×=xsn
reses
blan
k96
vap96_to
blank96 Fitted values
326 6.5e+06
.000281
.298789
44
Calculating t
7319.601115.
07510.0
−=
−=t
45
The STATA output
. reg lblank96 lvap96
Source | SS df MS Number of obs = 1861-------------+------------------------------ F( 1, 1859) = 45.32
Model | 22.941515 1 22.941515 Prob > F = 0.0000Residual | 941.080329 1859 .506229332 R-squared = 0.0238
-------------+------------------------------ Adj R-squared = 0.0233Total | 964.021844 1860 .518291314 Root MSE = .7115
------------------------------------------------------------------------------lblank96 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------lvap96 | -.0750985 .0111556 -6.73 0.000 -.0969774 -.0532197_cons | -3.129858 .1113781 -28.10 0.000 -3.348298 -2.911419
------------------------------------------------------------------------------
46
A word about standard errors and collinearity
• The problem: if X1 and X2 are highly correlated, then it will be difficult to precisely estimate the effect of either one of these variables on Y
47
How does having another collinearindependent variable affect standard
errors?
s eN n
SS
RR
Y
X
Y
X. .( $ )β1
2
2
2
2
11
11
1 1
=− −
−−
R2 of the “auxiliary regression” of X1 on allthe other independent variables
48
Example: Effect of party, ideology, and religiosity on feelings toward
Quincy BushBush
FeelingsConserv. Repub. Religious
Bush Feelings
1.0 .39 .57 .16
Conserv. 1.0 .46 .18
Repub. 1.0 .06
Relig. 1.0
49
Regression table(1) (2) (3) (4)
Intercept 32.7(0.85)
32.9(1.08)
32.6(1.20)
29.3(1.31)
Repub. 6.73(0.244)
5.86(0.27)
6.64(0.241)
5.88(0.27)
Conserv. --- 2.11(0.30)
--- 1.87(0.30)
Relig. --- --- 7.92(1.18)
5.78(1.19)
N 1575 1575 1575 1575R2 .32 .35 .35 .36