Sept. 30 - Pennsylvania State Universitypersonal.psu.edu/drh20/100/lectures/lecture14Sept30.pdf · Sept. 30 In Chapter 10, try exercises 8, 15, 20 on pages 222–224. Monday’s assignment:

Sept. 30 In Chapter 10, try exercises 8, 15, 20 on pages 222–224.

Monday’s assignment: #7, 12 on pp. 222–223

Deterministic relationship:

20 30 40 50 60 70 80

-10

010

2030

Fahrenheit

Celsius

In this case, the relationship happens to be linear—so what is r? Value of response is exactly determined by explanatory variable.

Statistical relationship:

Value of response only partly determined by explanatory variable

Note: Unlike for a bar graph, a scatterplot is usually not misleading when the axis doesn’t begin at zero.

10 15 20 25 30 35 40

6080

100

120

140

160

180

200

Age

Weight

Another statistical relationship STAT 100, section 004, Fall 2013

0 50 100 150 200

510

1520

2530

35

Fastest Speed Ever Driven

Gra

de P

oint

Ave

rage

Another statistical relationship (Outlier removed)

Regression equation: GPA = 3.35 − 0.0031×Speed

0 50 100 150 200

2.0

2.5

3.0

3.5

4.0


Gra

de P

oint

Ave

rage

How do we interpret: 3.35? –0.0031?

Correlation (r): Strength of linear association

r = 0.877 r = −0.149

10 15 20 25 30 35 40

6080

100

120

140

160

180

200

Age

Weight

0 50 100 150 200

2.0

2.5

3.0

3.5

4.0


Gra

de P

oint

Ave

rage

What if we change units? Say we measured weight in kg and speed in KPH

r = 0.877 r = −0.149

5 10 15 20 25 30 35 40

50

100

150

200

Age

Weight

New r (using kg) = 0.877 New r (using KPH) = −0.149

0 50 100 150 2002.0

2.5

3.0

3.5

4.0


Gra

de

Po

int A

vera

ge

Facts about Correlation (review): §  We use the letter “r” to denote the correlation coefficient. §  The correlation coefficient is a measure of the strength of

the linear relationship between the two variables in a scatterplot.

§  The value of r must always be between −1 and 1: a.  r=0 means no linear relationship. b.  Positive r means the two variables tend to increase

together (with r=1 meaning a perfect linear relationship) c.  Negative r means that one variable increases while the

other decreases (with −1 meaning a perfect linear relationship)

One more: r is unitless, and switching units does not change r.

-1 0 1 2

-1

0

1

2

X

YCorrelation = .10

-2 -1 0 1 2

-2

-1

0

1

2

3

X

Y

Correlation = .37

-2 -1 0 1 2

-2

-1

0

1

2

X

Y

Correlation = .97

-2 -1 0 1 2

-1

0

1

2

X

Y

Correlation = .70

Strength of relationship is not the same as statistical significance n Strength of linear relationship is measured by

correlation coefficient, r. n Statistical significance is measured as follows:

Assume that the truth is NO linear relationship. What proportion of randomly generated scatterplots would have a stronger linear relationship than the one observed?

ANSWER: p-value

Strength vs. statistical significance

n Even a weak relationship can be statistically significant (if it is based on a large sample)

n Even a strong relationship can be statistically insignificant (if it is based on a small sample)

Common rule of thumb: If p-value is smaller than 0.05 (five percent), then the result is considered statistically significant.

Correlation (r): Strength of linear association

r = 0.877 r = −0.149

5 10 15 20 25 30 35 40

50

100

150

200

Age

Weight

p-value = 0.123 p-value = 0.199

Not significant

Not significant

0 50 100 150 2002.0

2.5

3.0

3.5

4.0


Gra

de P

oint

Ave

rage

Average number of words a child knows at various ages

Imagine a scatter plot of the average number Of words a child knows for ages about 1.5 to 6. Relationship is nearly linear and quite strong.

Note the problem of extrapolation here: At age 1, predicted size of vocabulary is –251! Extrapolation means trying to predict beyond the range of the explanatory variable. (Remember the Sept. 12 example of running times?)

2 3 4 5 6

0

500

1000

1500

2000

2500

Age

Wor

dskn

own

Y = -806 + 555 X

S = 158.602 R-Sq = 97.1 % R-Sq(adj) = 96.6 %

Regression Plot

Correlation = .985

Another problem: Sometimes we see strong relationship in absurd examples. Two seemingly unrelated variables have a high correlation. This signals the presence of a third variable that is highly correlated with the other two. (Confounding)

Vocabulary vs Shoe Size

2 3 4 5 6

0

500

1000

1500

2000

2500

Shoe Size

Wor

dskn

own

Y = -806 + 555 X

S = 158.602 R-Sq = 97.1 % R-Sq(adj) = 96.6 %

Regression Plot

Correlation = .985

How can we have such high correlation between shoe size and vocabulary? (Note: These data were made up.) Easy: Both increase with age and hence age is a confounding (or hidden or lurking) variable. Age is positively correlated with both shoe size and with vocabulary.

Outliers Outliers are data that are not compatible with the bulk of the data. They show up in graphical displays as detached or stray points. Sometimes they indicate errors in data input. (This is quite a common occurrence!) Sometimes they are the most important data points.

PA Election Fraud (case study 23.1, page 508)

Special election to fill state senate seat in 1993. William Stinson (D) received 19,127 machine counted votes Bruce Marks (R) received 19,691 machine votes Stinson got 1,391 absentee votes Marks got 366 Stinson wins by 461 votes Question: Is this an unusual number of absentee votes?

Dem minus Rep vote counts (so positive means D ahead) For absentee versus machine

750050002500 0

1000

0

-1000

machine

abse

ntee

S = 294.363 R-Sq = 62.0 % R-Sq(adj) = 57.8 % - 0.0000285 machine**2

absentee = -182.575 + 0.295319 machine

95% PI

Regression

Regression Plot

Documents

Sept. 30 - Pennsylvania State Universitypersonal.psu.edu/drh20/100/lectures/lecture14Sept30.pdf · Sept. 30 In Chapter 10, try exercises 8, 15, 20 on pages 222–224. Monday’s assignment: