Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Sept. 30 In Chapter 10, try exercises 8, 15, 20 on pages 222–224.
Monday’s assignment: #7, 12 on pp. 222–223
Deterministic relationship:
20 30 40 50 60 70 80
-10
010
2030
Fahrenheit
Celsius
In this case, the relationship happens to be linear—so what is r? Value of response is exactly determined by explanatory variable.
Statistical relationship:
Value of response only partly determined by explanatory variable
Note: Unlike for a bar graph, a scatterplot is usually not misleading when the axis doesn’t begin at zero.
10 15 20 25 30 35 40
6080
100
120
140
160
180
200
Age
Weight
Another statistical relationship STAT 100, section 004, Fall 2013
0 50 100 150 200
510
1520
2530
35
Fastest Speed Ever Driven
Gra
de P
oint
Ave
rage
Another statistical relationship (Outlier removed)
Regression equation: GPA = 3.35 − 0.0031×Speed
0 50 100 150 200
2.0
2.5
3.0
3.5
4.0
Fastest Speed Ever Driven
Gra
de P
oint
Ave
rage
How do we interpret: 3.35? –0.0031?
Correlation (r): Strength of linear association
r = 0.877 r = −0.149
10 15 20 25 30 35 40
6080
100
120
140
160
180
200
Age
Weight
0 50 100 150 200
2.0
2.5
3.0
3.5
4.0
Fastest Speed Ever Driven
Gra
de P
oint
Ave
rage
What if we change units? Say we measured weight in kg and speed in KPH
r = 0.877 r = −0.149
5 10 15 20 25 30 35 40
50
100
150
200
Age
Weight
New r (using kg) = 0.877 New r (using KPH) = −0.149
0 50 100 150 2002.0
2.5
3.0
3.5
4.0
Fastest Speed Ever Driven
Gra
de
Po
int A
vera
ge
Facts about Correlation (review): § We use the letter “r” to denote the correlation coefficient. § The correlation coefficient is a measure of the strength of
the linear relationship between the two variables in a scatterplot.
§ The value of r must always be between −1 and 1: a. r=0 means no linear relationship. b. Positive r means the two variables tend to increase
together (with r=1 meaning a perfect linear relationship) c. Negative r means that one variable increases while the
other decreases (with −1 meaning a perfect linear relationship)
One more: r is unitless, and switching units does not change r.
-1 0 1 2
-1
0
1
2
X
YCorrelation = .10
-2 -1 0 1 2
-2
-1
0
1
2
3
X
Y
Correlation = .37
-2 -1 0 1 2
-2
-1
0
1
2
X
Y
Correlation = .97
-2 -1 0 1 2
-1
0
1
2
X
Y
Correlation = .70
Strength of relationship is not the same as statistical significance n Strength of linear relationship is measured by
correlation coefficient, r. n Statistical significance is measured as follows:
Assume that the truth is NO linear relationship. What proportion of randomly generated scatterplots would have a stronger linear relationship than the one observed?
ANSWER: p-value
Strength vs. statistical significance
n Even a weak relationship can be statistically significant (if it is based on a large sample)
n Even a strong relationship can be statistically insignificant (if it is based on a small sample)
Common rule of thumb: If p-value is smaller than 0.05 (five percent), then the result is considered statistically significant.
Correlation (r): Strength of linear association
r = 0.877 r = −0.149
5 10 15 20 25 30 35 40
50
100
150
200
Age
Weight
p-value = 0.123 p-value = 0.199
Not significant
Not significant
0 50 100 150 2002.0
2.5
3.0
3.5
4.0
Fastest Speed Ever Driven
Gra
de P
oint
Ave
rage
Average number of words a child knows at various ages
Imagine a scatter plot of the average number Of words a child knows for ages about 1.5 to 6. Relationship is nearly linear and quite strong.
Note the problem of extrapolation here: At age 1, predicted size of vocabulary is –251! Extrapolation means trying to predict beyond the range of the explanatory variable. (Remember the Sept. 12 example of running times?)
2 3 4 5 6
0
500
1000
1500
2000
2500
Age
Wor
dskn
own
Y = -806 + 555 X
S = 158.602 R-Sq = 97.1 % R-Sq(adj) = 96.6 %
Regression Plot
Correlation = .985
Another problem: Sometimes we see strong relationship in absurd examples. Two seemingly unrelated variables have a high correlation. This signals the presence of a third variable that is highly correlated with the other two. (Confounding)
Vocabulary vs Shoe Size
2 3 4 5 6
0
500
1000
1500
2000
2500
Shoe Size
Wor
dskn
own
Y = -806 + 555 X
S = 158.602 R-Sq = 97.1 % R-Sq(adj) = 96.6 %
Regression Plot
Correlation = .985
How can we have such high correlation between shoe size and vocabulary? (Note: These data were made up.) Easy: Both increase with age and hence age is a confounding (or hidden or lurking) variable. Age is positively correlated with both shoe size and with vocabulary.
Outliers Outliers are data that are not compatible with the bulk of the data. They show up in graphical displays as detached or stray points. Sometimes they indicate errors in data input. (This is quite a common occurrence!) Sometimes they are the most important data points.
PA Election Fraud (case study 23.1, page 508)
Special election to fill state senate seat in 1993. William Stinson (D) received 19,127 machine counted votes Bruce Marks (R) received 19,691 machine votes Stinson got 1,391 absentee votes Marks got 366 Stinson wins by 461 votes Question: Is this an unusual number of absentee votes?
Dem minus Rep vote counts (so positive means D ahead) For absentee versus machine
750050002500 0
1000
0
-1000
machine
abse
ntee
S = 294.363 R-Sq = 62.0 % R-Sq(adj) = 57.8 % - 0.0000285 machine**2
absentee = -182.575 + 0.295319 machine
95% PI
Regression
Regression Plot