Upload
anunoy-maujumder
View
222
Download
0
Embed Size (px)
Citation preview
8/7/2019 Sampling distributions, p-value, significance & confidence
1/30
Sampling Distributions, p-value,Significance & Confidence levels
Data Analysis : Tests of significancebased on T, F & Z Distribution andChi- Square test
8/7/2019 Sampling distributions, p-value, significance & confidence
2/30
Sampling & SamplingDistributions Characteristics of a sample are called statistic.
Characteristics of population are called parameter.
We try to estimate the population parameter based onsample statistic.
Estimates are subject to two types of Errors Type I &TypeII
Characteristic Symbols
Population Sample
Size = N Size = n
Mean = Mean= x-bar
Standard Deviation= Standard Deviation =s
8/7/2019 Sampling distributions, p-value, significance & confidence
3/30
Sampling Distributions The Concept
Suppose you, a team of 4 students, have been asked to collectsample of 40 from a city of population of 1,00,000 usingprobability sampling method.
Each student needs to collect 10 samples of 20-25 year old men.
Your objective is to find mean height of the samples so as to infer
the mean height of the population. For each sample, the mean height and standard deviation are
calculated.
The mean heights and standard deviations for the 4 samples aredifferent.
A probability distribution of all the possible means of the samples
is called sampling distribution of the mean.
8/7/2019 Sampling distributions, p-value, significance & confidence
4/30
Examples of Population,Sample, Sample Statistic &
Sampling DistributionsPopulation Sample Sample
StatisticSamplingDistribution
Water in a river 10-gallons ofwater Mean no. of partsof impurity permillion parts ofwater
Samplingdistribution ofmean
All IPL teams Group of 5
players
Median height Sampling
distribution ofmedian
All partsproduced by amanufacturing
process
50 parts ProportionDefective
Samplingdistribution ofproportion
8/7/2019 Sampling distributions, p-value, significance & confidence
5/30
8/7/2019 Sampling distributions, p-value, significance & confidence
6/30
Standard Error (2) We would observe different sample mean.
This variability in the sample statistic is due to chance i.e.
differences are solely due to the elements we happened to
choose for the samples.
The standard deviation of the distribution of sample means
measures the extent to which we expect the means from the
different samples to vary because of this chance error in the
sampling process. So it is called standard error.
Standard error indicates not only the size of the chance error, but
also the accuracy we are likely to get if we use a sample statisticto estimate a population parameter.
A distribution of sample means with less spread (with less
standard error) is a better estimate of the population.
8/7/2019 Sampling distributions, p-value, significance & confidence
7/30
amp ng NormaDistribution
Experience of Five Bike Owners with Tyre Life
Owner C D E F G
Tyre Life(months)
3 3 7 9 14
Population consists of only five people We will take all possible samples of the
owners in groups of 3.
Compute the sample means x-bar an computethe mean of the sampling distribution s.
8/7/2019 Sampling distributions, p-value, significance & confidence
8/30
amp ng ormaDistribution
Calculation of Sample Mean Tyre Life, n =3Samples of
ThreeSample Data Sum Sample Mean
EFG 7+9 +14 30 10
DFG 3+9+14 26 8.6667
DEG 3+7+14 24 8
DEF 3+7+9 19 6.3333
CFG 3+9+14 26 8.6667
CEG 3+7+14 24 8
CEF 3+7+9 19 6.3333CDF 3+3+9 15 5
CDE 3+3+7 13 4.3333
CDG 3+3+14 20 6.6667
Total 72
Mean of the sampling distribution =72/10 =7.2
The table with calculation is shown as under--
8/7/2019 Sampling distributions, p-value, significance & confidence
9/30
Sampling & NormalDistribution
3
7
6 9 12 15
9 14
Tyre life in months
Pro
ba
bilit
y
2 4 6 8 10Tyre life in months
Pro
ba
bilit
y
4.6667
6.3333
10
PopulationDistribution
Samplingdistribution of themean with n=3
8/7/2019 Sampling distributions, p-value, significance & confidence
10/30
Sampling & NormalDistribution
n = 2
Tyre life in months
Pro
ba
bilit
y
Tyre life in months
Pro
ba
bilit
y
n = 4
Sampling
distribution of themean with n=2
Samplingdistribution of themean with n=4
8/7/2019 Sampling distributions, p-value, significance & confidence
11/30
Sampling & NormalDistribution
n = 8
Pro
ba
bilit
y
Pro
ba
bilit
y
n=20
Samplingdistribution of the
mean with n=8
Samplingdistribution of the
mean with n=20
If population size is increased to 40 andwe take larger sample sizes of 8 and 20
Calculate x-bar and s
Plot the distributions
8/7/2019 Sampling distributions, p-value, significance & confidence
12/30
8/7/2019 Sampling distributions, p-value, significance & confidence
13/30
P-value Consider an experiment where you've measured
values in two samples, and the means are different.How sure are you that the population means aredifferent as well? There are two possibilities:
The populations have different means.
The populations have the same mean, and thedifference you observed is a coincidence ofrandom sampling.
The P value is a probability, with a value rangingfrom zero to one.
It is the answer to this question: If the populationsreally have the same mean overall, what is theprobability that random sampling would lead to adifference between sample means as large (orlarger) than you observed?
8/7/2019 Sampling distributions, p-value, significance & confidence
14/30
P value
Many people misunderstand what question a P value answers.
If the P value is 0.03, that means that there is a 3% chance ofobserving a difference as large as you observed even if the twopopulation means are identical.
It is tempting to conclude, therefore, that there is a 97% chancethat the difference you observed reflects a real differencebetween populations and a 3% chance that the difference is
due to chance. Wrong. What you can say is that random sampling from
identical populations would lead to a difference smaller thanyou observed in 97% of experiments and larger than youobserved in 3% of experiments.
You have to choose. Would you rather believe in a 3%coincidence? Or that the population means are really different?
8/7/2019 Sampling distributions, p-value, significance & confidence
15/30
a s ca ypo es stesting
The P value is a fraction.
The steps of statistical hypothesis testing are- Set a threshold P value before you do the experiment. In
fact, the threshold value (called alpha) is traditionallyalmost always set to 0.05.
Define the null hypothesis. If you are comparing two
means, the null hypothesis is that the two populationshave the same mean.
Do the appropriate statistical test to compute the P value.
Compare the P value to the preset threshold value.
If the P value is less than the threshold, state that you
"reject the null hypothesis" and that the difference is"statistically significant".
If the P value is greater than the threshold, state that you"do not reject the null hypothesis" and that the differenceis "not statistically significant"
8/7/2019 Sampling distributions, p-value, significance & confidence
16/30
Significance Level The term significantis seductive, and it is easy to
misinterpret it. A result is said to be statistically significantwhen the result
would be surprising if the populations were really identical. Aresult is said to be statistically significant when the P value isless than a preset threshold value.
It is easy to read far too much into the word significant
because the statistical use of the word has a meaningentirely distinct from its usual meaning. Just because adifference is statistically significantdoes not mean that it isimportant or interesting.
And a result that is not statistically significant(in the first
experiment) may turn out to be very important.
8/7/2019 Sampling distributions, p-value, significance & confidence
17/30
Significance Level
If a result is statistically significant, there are two possible
explanations:
The populations are identical, so there really is no difference. You
happened to randomly obtain larger values in one group and
smaller values in the other, and the difference was large enough to
generate a P value less than the threshold you set. Finding a
statistically significant result when the populations are identical is
called making a Type I error.
The populations really are different, so your conclusion is correct.
8/7/2019 Sampling distributions, p-value, significance & confidence
18/30
Significance Level
There are also two explanations for a result that is not statistically
significant:
The populations are identical, so there really is no difference. Any
difference you observed in the experiment was a coincidence. Your
conclusion of no significant difference is correct.
The populations really are different, but you missed the difference
due to some combination of small sample size, high variability and
bad luck. The difference in your experiment was not large enough to
be statistically significant. Finding results that are not statistically
significant when the populations are different is called making a Type
II error.
8/7/2019 Sampling distributions, p-value, significance & confidence
19/30
mean
Statistical calculations produce two kinds of results that
help you make inferences about the populations from the
samples. You've already learned about P values. The
second kind of result is a confidence interval.
95% confidence interval of a mean
Although the calculation is exact, the mean you calculate
from a sample is only an estimate of the population mean.
How good is the estimate? It depends on how large your
sample is and how much the values differ from one
another.
8/7/2019 Sampling distributions, p-value, significance & confidence
20/30
Statistical calculations combine sample size and variability to generate a
confidence interval for the population mean.
You can calculate intervals for any desired degree of confidence, but 95%
confidence intervals are used most commonly. If you assume that your
sample is randomly selected from some population, you can be 95% sure
that the confidence interval includes the population mean.
More precisely, if you generate many 95% CI from many data sets, you
expect the CI to include the true population mean in 95% of the cases and
not to include the true mean value in the other 5%.
Since you don't know the population mean, you'll never know for sure
whether or not your confidence interval contains the true mean.
con ence n erva o amean
8/7/2019 Sampling distributions, p-value, significance & confidence
21/30
Why 95%?
There is nothing special about 95%. It is just convention that
confidence intervals are usually calculated for 95% confidence.
In theory, confidence intervals can be computed for any degree
of confidence. If you want more confidence, the intervals will be
wider. If you are willing to accept less confidence, the intervals
will be narrower.
S li Di t ib ti f th
8/7/2019 Sampling distributions, p-value, significance & confidence
22/30
Sampling Distribution of the meanwhen population is normally
distributed Sampling distribution has a mean equal to the population
x-bar =
Sampling distribution has a standard deviation (standard error)
equal to the population standard deviation divided by the
square root of the sample size
s = /n
8/7/2019 Sampling distributions, p-value, significance & confidence
23/30
T t f Si ifi T t f
8/7/2019 Sampling distributions, p-value, significance & confidence
24/30
Test of Significance or Test ofHypothesis The theory of hypothesis testing begins with assumption about the
parameter of the population.
The assumption is termed hypothesis, made on the basis of sample
observation.
The validity of hypothesis is tested by analyzing the sample.
The procedure is calledTest of Significance or Test of Hypothesis.
Conventional approach is to set up two different hypotheses, which is
so constructed that if one hypothesis is accepted, the other is rejected
and vice-versa.
Hypothesis are i) Null Hypothesis (Ho) ii) Alternate Hypothesis.(Ha) Example :
Ho : = 100 , Ha: 100
M th d t T t
8/7/2019 Sampling distributions, p-value, significance & confidence
25/30
Methods to TestHypothesis
Z-Test
T-test
F-TestChi-square Test
ANOVA
8/7/2019 Sampling distributions, p-value, significance & confidence
26/30
Z-value
Sample Population
s
xx
z
=
=
x
z
Z = difference between observed value and expected
value/standard deviation
The z-value tells us how many standard deviations aboveor below the mean our data value x is.
Positive z-values are above the mean,
Negative z-values are below the mean
Z l
8/7/2019 Sampling distributions, p-value, significance & confidence
27/30
The area covered between mean and the Z-value isthe probability.
For every Z-value, there is a table, which givescorresponding area and therefore the probability
Probability =0.4875, meaning values would lie withinthe limit specified by z=2.24 from the mean.
Z-value
0.4875 of
area
Z =2.24
Standard Deviations & Z
8/7/2019 Sampling distributions, p-value, significance & confidence
28/30
Standard Deviations & Z-Value
- +
0.6828
of area
-2+2
0.9554of area
-3 +3
0.9974of area
-1.64 +1.64
0.9 ofarea
Z =1
Z =2
Z = 3
Z =1.64
8/7/2019 Sampling distributions, p-value, significance & confidence
29/30
P=0.05 and Z-value
+1.96-1.96
Acceptance Region
RegionofRejection
RegionofRejection
8/7/2019 Sampling distributions, p-value, significance & confidence
30/30
Z-value example
For a sample of females, the mean BMI (body massindex) was 26.20 and the standard deviation was6.57.
A person with a BMI of 19.2 has a z score of:
s
xx
z
= 07.157.6
20.262.19=
=
So this person has a BMI 1.07 standard deviations below the
mean