Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
5. 추 정 (Estimation)
2014/4/17
5.1 머리말 (Intoriduction)
• 통계적 추측 (statistical inference)
– 어느 모집단으로부터 구한 표본에서
얻어진 결과를 기초로 그 모집단에 관해 추측하는 과정
– Say something about the population based on the information of the sample
1) 추정(estimation)
2) 가설검정(hypothesis testing)
• 추정치(estimate)
1) 점추정(point estimate)
2) 구간추정(interval estimate)
• 추정식(estimator)
• 불편이성(unbiasedness)
• Target population vs. sampling population
3ix
xn
의추정식
2
ˆ( . . based on data) is an unbaised estimator of (parameter)
ˆif ( )
. ( ) , so sample mean is an ue of the population mean
if the samples are randomly selected from ( , )
r v
E
ex E X
N
예) sample variance ( ) is an unbiased estimator of
So
: not an unbiased estimator
• Bias =
• Bias of an unbiased estimator is zero
• Probability sampling and non-probability sampling
• Randomization
• Blinding
2 2( )E s
2 21( )iE y y
n
2s2
ˆ( )E
5.2 모집단평균의 신뢰구간 (Confidence interval of population mean)
•
• If we select samples repeatedly from normal population, will include with the probability of
• :confidence level (ex. .95) 신뢰수준
:significance level (ex. .05) 유의수준
= )
)
신뢰구간 추정평균 (신뢰도계수 표준오차
conf. int.=eatimated value (reliability coef SE
(1 2) xx z
(1 2) xx z
100(1- )%
1
<보기 5.2.1>
A researcher measures amount of a certain enzyme. n=10, sample mean=22, We can assume normality with pop variance=45. . 95% C.I. of ?
45100
2 22 2 (17.76, 26.24)xx
<보기 5.2.2>
Measuring maximum strength of a certain muscle. We want 99% CI of the pop mean. We assume normality with pop variance=144. n=15, sample mean=84.3,
z=2.58 with 0.99 confidence level,
SE=
What is the 99% C.I. of ?
12 15 3.10x
84.3 2.58(3.10) 84.3 8.0 (76.3, 92.3)
• Sample from non-normal pop central limit theorem
<보기 5.2.3>
delay time because of patient’s being late at a clinic, n=35, sample mean=17.2 min, sd from the previous study (assumed to be known)=8 min. Pop is not normally dist’ed.
what is 90% CI of ?
17.2 1.645(1.35) 17.2 2.2 15.0, 19.4
8/ 35 1.35x
<보기 5.2.4>
Measure activity of a certain enzyme from 35 patients
Population variance= 0.36, 95% CI?
-> apply CLT
0.7164 1.96(.6 / 35) (0.5174, 0.9155)
CI calculation using R
> m <- 0.7164
> s <- sqrt(0.36)
> n <- 35
> alpha <- 0.05
> error <- qnorm(1-alpha/2)*s/sqrt(n)
> left <- m-error
> right <- m+error
> left
[1] 0.5176234
> right
[1] 0.9151766
>
confint <- function(m,s,n,alpha=0.05){ error <- qnorm(1-alpha/2)*s/sqrt(n) left <- m-error right <- m+error print(c(left,right)) } confint(0.7164,sqrt(0.36),35) [1] 0.5176234 0.9151766
5.3 t-분포 (t-dist’n)
• Pop variance is known and n is large:
• Pop variance is not known and n is large :
• Small sample size (n<30) :
derived by Gosset “Student’s t-dist’n”
표 E
2( )
1
ix xs
n
1n
xt t
s n
xz
n
x <- seq(-4, 4, length=100) hx <- dnorm(x) degf <- c(1, 3, 8, 30) colors <- c("red", "blue", "darkgreen", "gold", "black") labels <- c("df=1", "df=3", "df=8", "df=30", "normal") plot(x, hx, type="l", lty=2, xlab="x value", ylab="Density", main="Comparison of t Distributions") for (i in 1:4){ lines(x, dt(x,degf[i]), lwd=2, col=colors[i]) } legend("topright", inset=.05, title="Distributions", labels, lwd=2, lty=c(1, 1, 1, 1, 2), col=colors)
• Some properties of t-dist’n
1) mean= 0
2) Symmetric about the mean
3) Variance > 1, -> 1 as n -> ∞
4)
5) The shape depends on degrees-of-freedom)=n-1
t
6) Flatter at the center
and heavier tails
than normal dist’n
7) t-dist’n -> normal dist’n as n-1 -> ∞
• CI :
(1 2)
sx t
n
<보기 5.3.1>
n=15, measure Amylase, sample mean=96unit/100ml, sd= 35, pop variance is not known. 95% CI of the pop mean?
0.975
SE( )= 35 15 9.04
1 14
2.1448
96 2.1448(9.04) 96 19 (77,115)
x s n
n
t
df=
• Choice of z and t
pop~normal?
enough n ? enough n ?
σ2 known? σ2 known? σ2 known? σ2 known?
applying CLT Non-parametric methods
5.4 CI of the difference of the two means
• Samples from normal pop’s
<보기 5.4.1>
Measure serum uric acid from 12 patients , measurements from 15 normal controls , variances are known to be 1 for each group, 95% CI for ?
2 21 2
1 2 1 2
1 2
x x zn n
1 4.5 /100x ml ml
2 3.4x
1 2 -
1 11.1 1.96( ) 1.1 1.96(.39) (.3, 1.9)
12 15
CI does not include 0
• Sample from non-normal pop central limit theorem
<보기 5.4.2>
To compare socio-economic status (SES) of patients from two hospitals. 75 pts from hospital A: , 80 pts from hospital B: , pop variances are
99% CI of ?
1 6,800x
2 4,450x 1 2600, 500,
1 2 -
2 2(600) (500)(6800 4450) 2.58 (2120, 2580)
75 80
• t-dist’n and difference of the means:
In practice, pop variance is typically unknown. Two approaches
1) Same variances,
2) Different variances
1) When the variances are the same:
we calculate pooled estimate by calculating weighted average of the variances
2 22 1 1 2 2
1 2
( 1) ( 1)
2p
n s n ss
n n
df=
<보기 5.4.3>
to measure Amylase: 15 normal controls(group2) sample mean=96, sd=35. 22 pts(group1) sample mean and sd=120 and 40. pop ∼normal. Variances unknown but equal.
1 2100(1 )% of CI
2 2
1 2 (1 2)
1 2
( )p ps s
x x tn n
1 2 2n n
2 22 14(35) 21(40)
145015 22 2
ps
1450 1450(120 96) 2.0301 2,50
15 22
1 2100(1 )% of ?CI
2) When the variances are different
does not follow t-dist! 1 2 1 2
2 21 2
1 2
( ) ( )x x
s s
n n
1 1 2 21 2
1 2
'w t w t
tw w
2 2
1 21 2 (1 2)
1 2
( ) 's s
x x tn n
2 21 1 1 2 2 2
1 1 1 2 2 2 1 2
* , ,
1 , 1
w s n w s n
df n t t df n t t
<보기 5.4.3>
To measure some bio-marker. We can assume normal dist’n. But variances are not the same.
1 295% CI of ?
114.244(2.2622) 5.1005(2.0930)' 2.255
114.244 5.1005t
2 233.8 10.1(62.6 47.2) 2.255 ( 9.2,40.0)
10 20
t9 t19
Pts normal
pop~normal?
σ2 known?
applying CLT Non-parametric methods
숙제
• 5.2.4 5.2.5
• 5.3.4 5.3.5
• 5.4.3 5.4.10
5.5 모집단 비율의 신뢰구간(CI of proportion)
•
•
<보기 5.5.1>
behavior of oral hygiene. 123 take 2 oral exams per year out of 300
(1 2) (1 )p z p p n
100(1 )%- CI of p?
95% CI of p?
123/300 0.41p
0.41 1.96 0.41(0.59) / 300 .36,.46
5.6 두 모집단 비율의 차이의 신뢰구간 CI of difference of two proportions
•
<보기 5.6.1>
Recovery times of a disease by two treatments Assign 200 pts randomly to two trt groups. Trt A: 78pts recovered within 3days, trt B 90 pts.
1 1 2 21 2 (1 2)
1 2
(1 ) (1 )p p p pp p z
n n
100(1 )% ? 1 2- CI of p -p
95% 1 2CI of p -p ?
(.78)(.22) (.90)(.10)(.78 .90) 1.96 ( .22, .02)
100 100
5.7 평균을 추정할 때의 표본수의 결정 sample size calculation: inference of the mean
•
• Sampling from an infinite pop
• Sampling without replacement from a small pop
( )d zn
reliability coef (SE)
2 2
2
zn
d
1
N nd z
Nn
2 2
2 2 2( 1)
Nzn
d N z
(reliability coef)d (width of the CI)/2, (SE)
<보기 5.7.1>
measure daily protein intake from teenage girls. Width of CI=10 (+-5). Confidence level= 0.95, pop sd=20, pop is very large; we can ignore finite pop correction factor
1.96, 20, 5z d
2 2 2 2
2 2
(1.96) (20)61.47 62
(5)
zn
d
girls
<보기 5.8.1>
surveying proportion of household with medical service. We know p<0.35. d of 95% CI =0.05, n=?
5.8 비율을 추정할 때 표본수의 결정 sample size calculation: inference of the proportion
2
2
2
2 2
( 0.05)
( 1)
z pqn n N
d
Nz pqn
d N z pq
2
2
(1.96) (.35)(.65)349.6 350
(.05)n households
숙제
• 5.5.3 5.6.2 5.7.3 5.8.4
5.9 정규분포 모집단 분산의 신뢰구간 CI of the variance from normal dist’n
• Point estimator of variance
Good estimator? ‘unbiasedness’
2 2( ) ?E s
• Pop=(6,8,10,12,14), n=2 (exam at chap4)
w replacement-
w/o replacement-
22
22
( )8
( )10
1
i
i
x
N
xS
N
2
2 20 2 0( ) 8
25
i
n
sE s
N
2
2 22 8 2( ) 10
10
i
Nn
sE s S
•
(chi-square distribution) 표 F
2 2 2large 1 ( )N N N E s S
22
2
( 1) of on the dis'n of
n sCI
n
2 2i
i=1
depends = (x -x) / .
22
12
( 1)n
n s
2
2 22 2
/ 2 (1 / 2)2 2
2 22 2
2 2(1 / 2) / 2
100(1 )%
( 1) ( 1) 100(1 )%
( 1) ( 1) 100(1 )%
n s n s
n s n s
CI of ?
CI of
CI of
<보기 5.9.1>
n=15, measure Amylase, 296, 35, 95%x s CI of ?
2 1225, 1 14s df n
2
2
(14)(1225) (14)(1225)
26.119 5.629
656.6101 3046.7223
25.62 55.20
5.10 두 정규분포 모집단의 분산비에 대한 신뢰구간 CI for the ratio of two variances
표 G
1 2
2 21 1
1, 12 22 2
n n
sF
s
•
<보기 5.10.1>
normal adults, n=21(group1). Parkinson disease pts, n=16명(group2). Observe response time of a certain stimulus. Sample variance of grp1=1600,grp2=1225.
21
22
100(1 )%
CI of ?
2 2 2 2 2 2 21 1 1 2 1 1 2
/ 2 (1 / 2)2 2 2(1 / 2) / 22 2 2
,s s s s s
F FF Fs
2 21 2 95% CI of ?
2 2 2 2 2 2 21 2 1 1 2 1 1
2 2 2.975 .0252 2 2
1600 1225 1600 1225, , .473 3.36
2.76 .389
s s s s
F F
• Statistical distributions
: sum of n independent normal rv’s
21 2, , , N( , ) nY Y Y random sample from iid
2
,Y Nn
22
212
1
( 1) ni
n
i
Y Yn s
0,1/
YN
n
1
/n
Yt
s n
2n
2
2
1
(0,1)n
i in
i
Y YN
22 2
/ 2, 1 1 / 2, 12
( 1)1n n
n sP
: ratio of 2 independent chi-squares (df= )
1 2,n nF
2 21 2
/ 2 1 / 22 22 1
1s
P F Fs
1 2,n n
1
1 2
2
21
,22
/
/
n
n n
n
nF
n
1 2
2 21 1
1, 12 22 2
/
/n n
sF
s
Homework
• 5.9.3 5.9.7 5.10.5 5.10.7
• 종합문제 13 21 23