Upload
nuncio
View
43
Download
0
Embed Size (px)
DESCRIPTION
10:30-12:00 December 10 2012 For Survey of Quantitative Research, NORSI. Session 2: Basic techniques for innovation data analysis . Part I: Statistical inferences and comparisons of groups. Taehyun Jung [email protected] CIRCLE, Lund University. Objectives of this session. - PowerPoint PPT Presentation
Citation preview
CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY
Session 2: Basic techniques for innovation data analysis.
Part I: Statistical inferences and comparisons of groups
Taehyun Jung [email protected]
CIRCLE, Lund University
10:30-12:00 December 10 2012
For Survey of Quantitative Research, NORSI
CIRCLE, Lund University, Sweden 2
Objectives of this session
CIRCLE, Lund University, Sweden
CorrelationStatistical Inference and Hypothesis Testingt-TestConfidence IntervalChi-square Statistic
3
Contents
CIRCLE, Lund University, Sweden
Correlation
4
CIRCLE, Lund University, Sweden
A scatterplot displays the strength, direction, and form of the relationship between two quantitative variables.
–As in any graph of data, look for the overall pattern and for striking departures from that pattern.
Form: linear, curved, clusters, no pattern Direction: positive, negative, no direction Strength: how closely the points fit the “form”
–An important kind of departure is an outlier, an individual value that falls outside the overall pattern of the relationship.
Scatterplot
5
CIRCLE, Lund University, Sweden 6
Linear
Nonlinear
No relationship
CIRCLE, Lund University, Sweden
With a strong relationship, you can get a pretty good estimate of y if you know x.
With a weak relationship, for any x you might get a wide range of y values.
The strength of the relationship between the two variables can be seen by how much variation, or scatter, there is around the main form.
7
CIRCLE, Lund University, Sweden
The correlation coefficient r measures the strength of the linear relationship between two quantitative variables.
– r is always a number between -1 and 1.– r > 0 indicates a positive association.– r < 0 indicates a negative association.– Values of r near 0 indicate a very weak linear relationship.– The strength of the linear relationship increases as r moves away from 0 toward -1
or 1.– The extreme values r = -1 and r = 1 occur only in the case of a perfect linear
relationship.– Part of the calculation involves finding z, the standardized score
Allows us to compare correlations between data sets where variables are measured in different units or when variables are different
The sample Pearson’s correlation coefficient r measures the strength of the linear relationship between two quantitative variables.
8
correlation
CIRCLE, Lund University, Sweden
Correlation makes no distinction between explanatory and response variables. r has no units and does not change when we change the units of measurement
of x, y, or both. Positive r indicates positive association between the variables, and negative r
indicates negative association. The correlation r is always a number between -1 and 1. Cautions
– Correlation requires that both variables be quantitative.– Correlation does not describe curved relationships between variables, no matter
how strong the relationship is.– Correlation is not resistant. r is strongly affected by a few outlying observations.– Correlation is not a complete summary of two-variable data.
Facts About Correlation
9
CIRCLE, Lund University, Sweden
Strength: How closely the points follow a straight line.
Direction is positive when individuals with higher x values tend to have higher values of y
“r” ranges from −1 to +1
10
CIRCLE, Lund University, Sweden
Correlations are calculated using means and standard deviations and thus are NOT resistant to outliers.
Just moving one point away from the general trend here decreases the correlation from −0.91 to −0.75.
Influential points
11
CIRCLE, Lund University, Sweden
Statistical Inference and Hypothesis Testing
12
CIRCLE, Lund University, Sweden
The normal distribution has the bell-shaped (Gaussian) form. –arises from the central limit theorem, which states that under mild
conditions, the mean of a large number of random variables independently drawn from the same distribution is distributed approximately normally, irrespective of the form of the original distribution
–very tractable analytically
Normal distribution
13
2
21
21
X
eXf
00 X
f (X )
+ +2 +3 3 2 4 +4
2,~ NX
CIRCLE, Lund University, Sweden
Assumption: –Null hypothesis : –Alternative hypothesis:
–We will suppose that we have observations on a random variable with a normal distribution with unknown mean m and that we wish to test the hypothesis that the mean is equal to some specific value .
Testing a hypothesis relating to the population mean
14
CIRCLE, Lund University, Sweden
Suppose that we have a sample of data for the example model and the sample mean is . Would this be evidence against the null hypothesis ?
– No, it is not. It is lower than , but we would not expect to be exactly equal to because the sample mean has a random component.
– If the null hypothesis is true, the probability of the sample mean being one standard deviation or more above or below the population mean is 31.7%.
15
CIRCLE, Lund University, Sweden
four standard deviations above the hypothetical mean?
– the chance of getting such an extreme estimate is only 0.006%.
– We would reject the null hypothesis The usual procedure for making
decisions is to reject the null hypothesis if it implies that the probability of getting such an extreme sample mean is less than some (small) probability p.
– For example, the probability of getting such an extreme sample mean is less than 0.05 (5%)
– The 2.5% tails of a normal distribution always begin 1.96 standard deviations from its mean
16
CIRCLE, Lund University, Sweden
Decision rule (5% significance level): Reject – (1) if > + 1.96 s.d. or (2) if < – 1.96 s.d.– (1) if >1.96 or (2) if < 1.96
17
• Type I error: rejection of H0 when it is in fact true.
• Probability of Type I error: in this case, 5%
• Significance level (size) of the test is 5%.
CIRCLE, Lund University, Sweden
We can of course reduce the risk of making a Type I error by reducing the size of the rejection region.
18
CIRCLE, Lund University, Sweden
t-Test
19
CIRCLE, Lund University, Sweden
What if we do not know the standard deviation? The test statistic has a t distribution instead of a normal distribution
20
s.d. of X known
discrepancy between hypothetical value and sample estimate, in terms of s.d.:
5% significance test:reject H0: = 0 ifz > 1.96 or z < –1.96
s.d. of X not known
discrepancy between hypothetical value and sample estimate, in terms of standard error (s.e.):
5% significance test:reject H0: = 0 ift > tcrit or t < –tcrit
s.d.0
Xz
s.e.0
Xt
CIRCLE, Lund University, Sweden
For a sample of size n, the sample standard deviation s is:
–n − 1 is the “degrees of freedom.”–The value s/√n is called the standard error of the mean SEM.–Scientists often present their sample results as the mean ± SEM.
21
2)(1
1 xxn
s i
CIRCLE, Lund University, Sweden
When the number of degrees of freedom is large, the t distribution looks very much like a normal distribution (and as the number increases, it converges on one)
Then, why t-dist?– Although the distributions are generally
quite similar, the t distribution has longer tails than the normal distribution, the difference being the greater, the smaller the number of degrees of freedom
– the rejection regions have to start more standard deviations away from zero for a t distribution than for a normal distribution
22
CIRCLE, Lund University, Sweden
A certain city abolishes its local sales tax on consumer expenditure. A survey of 20 households shows that, in the following month, mean household expenditure increased by $160 and the standard error of the increase was $60.
We wish to determine whether the abolition of the tax had a significant effect on household expenditure.
– We take as our null hypothesis that there was no effect: – The test statistic is
– The critical values of t with 19 degrees of freedom are 2.09 at the 5 percent significance level and 2.86 at the 1 percent level.
– Hence we reject the null hypothesis of no effect at the 5 percent level but not at the 1 percent level.
Example
23
67.260
0160
t
CIRCLE, Lund University, Sweden
The t tests are exactly correct when the population is distributed exactly normally. However, most real data are not exactly normal.
The t tests are robust to small deviations from normality. This means that the results will not be affected too much. Factors that do strongly matter are:
–Random sampling. The sample must be an SRS from the population.–Outliers and skewness. They strongly influence the mean and therefore the
t procedures. However, their impact diminishes as the sample size gets larger because of the Central Limit Theorem.
–Specifically: When n < 15, the data must be close to normal and without outliers. When 15 > n > 40, mild skewness is acceptable, but not outliers. When n > 40, the t statistic will be valid even with strong skewness.
Robustness
24
CIRCLE, Lund University, Sweden
Confidence interval
25
CIRCLE, Lund University, Sweden
Any hypothesis lying in the interval from min to
max would be compatible with the sample estimate (not be rejected by it). We call this interval the 95% confidence interval.
Confidence interval
26
(1)(2)
min min +sd Xmin–sdmin–1.96sd max max+1.96sdmax–sd max+sd
CIRCLE, Lund University, Sweden 27
Standard deviation known
95% confidence interval – 1.96 sd < < + 1.96 sd
99% confidence interval – 2.58 sd < < + 2.58 sd
Standard deviation estimated by standard error
95% confidence interval – tcrit (5%) se < < + tcrit (5%) se
99% confidence interval – tcrit (1%) se < < + tcrit (1%) se
CIRCLE, Lund University, Sweden
Chi-square statistic
28
CIRCLE, Lund University, Sweden
STATA command – . tab dused_ndef largef, col chi–Pearson chi2(1) = 11.4978 Pr = 0.001
Can we conclude that large firms use patents more strategically than small firms based on this table?
29
Use of patents by firm sizeSmall Firm Large Firm Column total
Non-strategic use
# 160 1,113 1,273
% column 91.95 81.66 82.82
Strategic use# 14 250 264
% column 8.05 18.34 17.18 Total # 174 1,363 1,537
% column 100.00 100.00 100.00
CIRCLE, Lund University, Sweden
The chi-square statistic () measures how far the sample is from what we “expect” to see in a random sample from a population with NO relationship.
– If is too far from what we expected, we conclude that the sample did not come from a population with no relationship and therefore conclude that the variables must be related in the population.
–H0: There is no relationship between categorical variable A and categorical variable B.
–Ha: There is some relationship between categorical variable A and categorical variable B.
This alternative hypothesis is not really one-sided (> or <) or two-sided (). It can be called “many-sided” because it allows any kind of relationship between variables A and B to count.
Chi-square hypothesis test
30
CIRCLE, Lund University, Sweden
We want to test the hypothesis that there is no relationship between these two categorical variables (H0).
– To test this hypothesis, we compare actual counts from the sample data with expected counts given the null hypothesis of no relationship.
– The expected count in any cell of a two-way table when H0 is true is:
The chi-square statistic (c2) is a measure of how much the observed cell counts in a two-way table diverge from the expected cell counts.
31
count expected count expected -count observed
22c
CIRCLE, Lund University, Sweden
Large values for represent strong deviations from the expected distribution under the H0, and provide evidence against H0.
However, since is a sum, how large a is required for statistical significance will depend on the number of comparisons made.
32
CIRCLE, Lund University, Sweden
For the chi-square test, H0 states that there is no association between the row and column variables in a two-way table. The alternative is that these variables are related.
If H0 is true, the chi-square test has approximately a χ2 distribution with (r − 1)(c − 1) degrees of freedom.
33
The P-value for the chi-square test is
the area to the right of c2 :
P(χ2 ≥ X2).
CIRCLE, Lund University, Sweden
Probability of rejecting the null hypothesis if H0 is true– Typically, .05 or .01 significance level– With a significance level of .05 and 1 df, X2=3.84; we will reject H0 when X2* is
greater than 3.84 and accept H0 when X2* is less than 3.84.– if the null hypothesis is true (if the variables are not related in the population), we
will still (incorrectly) reject H0 (conclude that the variables are related in the population) about 5 times (or 1 time) in 100 hypothesis tests
A key step in the hypothesis test is deciding how willing we are to make a Type I error. (We must take some chance of rejecting a true null hypothesis or we will have no chance of rejecting a false one.)
– Type I error: incorrectly rejecting the null hypothesis.– Type II error: Incorrectly accepting the null hypothesis.
Significance Level (alpha)
34