20
Sept. 30 In Chapter 10, try exercises 8, 15, 20 on pages 222–224. Monday’s assignment: #7, 12 on pp. 222–223

Sept. 30 - Pennsylvania State Universitypersonal.psu.edu/drh20/100/lectures/lecture14Sept30.pdf · Sept. 30 In Chapter 10, try exercises 8, 15, 20 on pages 222–224. Monday’s assignment:

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sept. 30 - Pennsylvania State Universitypersonal.psu.edu/drh20/100/lectures/lecture14Sept30.pdf · Sept. 30 In Chapter 10, try exercises 8, 15, 20 on pages 222–224. Monday’s assignment:

Sept. 30 In Chapter 10, try exercises 8, 15, 20 on pages 222–224.

Monday’s assignment: #7, 12 on pp. 222–223

Page 2: Sept. 30 - Pennsylvania State Universitypersonal.psu.edu/drh20/100/lectures/lecture14Sept30.pdf · Sept. 30 In Chapter 10, try exercises 8, 15, 20 on pages 222–224. Monday’s assignment:

Deterministic relationship:

20 30 40 50 60 70 80

-10

010

2030

Fahrenheit

Celsius

In this case, the relationship happens to be linear—so what is r? Value of response is exactly determined by explanatory variable.

Page 3: Sept. 30 - Pennsylvania State Universitypersonal.psu.edu/drh20/100/lectures/lecture14Sept30.pdf · Sept. 30 In Chapter 10, try exercises 8, 15, 20 on pages 222–224. Monday’s assignment:

Statistical relationship:

Value of response only partly determined by explanatory variable

Note: Unlike for a bar graph, a scatterplot is usually not misleading when the axis doesn’t begin at zero.

10 15 20 25 30 35 40

6080

100

120

140

160

180

200

Age

Weight

Page 4: Sept. 30 - Pennsylvania State Universitypersonal.psu.edu/drh20/100/lectures/lecture14Sept30.pdf · Sept. 30 In Chapter 10, try exercises 8, 15, 20 on pages 222–224. Monday’s assignment:

Another statistical relationship STAT 100, section 004, Fall 2013

0 50 100 150 200

510

1520

2530

35

Fastest Speed Ever Driven

Gra

de P

oint

Ave

rage

Page 5: Sept. 30 - Pennsylvania State Universitypersonal.psu.edu/drh20/100/lectures/lecture14Sept30.pdf · Sept. 30 In Chapter 10, try exercises 8, 15, 20 on pages 222–224. Monday’s assignment:

Another statistical relationship (Outlier removed)

Regression equation: GPA = 3.35 − 0.0031×Speed

0 50 100 150 200

2.0

2.5

3.0

3.5

4.0

Fastest Speed Ever Driven

Gra

de P

oint

Ave

rage

How do we interpret: 3.35? –0.0031?

Page 6: Sept. 30 - Pennsylvania State Universitypersonal.psu.edu/drh20/100/lectures/lecture14Sept30.pdf · Sept. 30 In Chapter 10, try exercises 8, 15, 20 on pages 222–224. Monday’s assignment:

Correlation (r): Strength of linear association

r = 0.877 r = −0.149

10 15 20 25 30 35 40

6080

100

120

140

160

180

200

Age

Weight

0 50 100 150 200

2.0

2.5

3.0

3.5

4.0

Fastest Speed Ever Driven

Gra

de P

oint

Ave

rage

Page 7: Sept. 30 - Pennsylvania State Universitypersonal.psu.edu/drh20/100/lectures/lecture14Sept30.pdf · Sept. 30 In Chapter 10, try exercises 8, 15, 20 on pages 222–224. Monday’s assignment:

What if we change units? Say we measured weight in kg and speed in KPH

r = 0.877 r = −0.149

5 10 15 20 25 30 35 40

50

100

150

200

Age

Weight

New r (using kg) = 0.877 New r (using KPH) = −0.149

0 50 100 150 2002.0

2.5

3.0

3.5

4.0

Fastest Speed Ever Driven

Gra

de

Po

int A

vera

ge

Page 8: Sept. 30 - Pennsylvania State Universitypersonal.psu.edu/drh20/100/lectures/lecture14Sept30.pdf · Sept. 30 In Chapter 10, try exercises 8, 15, 20 on pages 222–224. Monday’s assignment:

Facts about Correlation (review): §  We use the letter “r” to denote the correlation coefficient. §  The correlation coefficient is a measure of the strength of

the linear relationship between the two variables in a scatterplot.

§  The value of r must always be between −1 and 1: a.  r=0 means no linear relationship. b.  Positive r means the two variables tend to increase

together (with r=1 meaning a perfect linear relationship) c.  Negative r means that one variable increases while the

other decreases (with −1 meaning a perfect linear relationship)

One more: r is unitless, and switching units does not change r.

Page 9: Sept. 30 - Pennsylvania State Universitypersonal.psu.edu/drh20/100/lectures/lecture14Sept30.pdf · Sept. 30 In Chapter 10, try exercises 8, 15, 20 on pages 222–224. Monday’s assignment:

-1 0 1 2

-1

0

1

2

X

YCorrelation = .10

-2 -1 0 1 2

-2

-1

0

1

2

3

X

Y

Correlation = .37

-2 -1 0 1 2

-2

-1

0

1

2

X

Y

Correlation = .97

-2 -1 0 1 2

-1

0

1

2

X

Y

Correlation = .70

Page 10: Sept. 30 - Pennsylvania State Universitypersonal.psu.edu/drh20/100/lectures/lecture14Sept30.pdf · Sept. 30 In Chapter 10, try exercises 8, 15, 20 on pages 222–224. Monday’s assignment:

Strength of relationship is not the same as statistical significance n Strength of linear relationship is measured by

correlation coefficient, r. n Statistical significance is measured as follows:

Assume that the truth is NO linear relationship. What proportion of randomly generated scatterplots would have a stronger linear relationship than the one observed?

ANSWER: p-value

Page 11: Sept. 30 - Pennsylvania State Universitypersonal.psu.edu/drh20/100/lectures/lecture14Sept30.pdf · Sept. 30 In Chapter 10, try exercises 8, 15, 20 on pages 222–224. Monday’s assignment:

Strength vs. statistical significance

n Even a weak relationship can be statistically significant (if it is based on a large sample)

n Even a strong relationship can be statistically insignificant (if it is based on a small sample)

Common rule of thumb: If p-value is smaller than 0.05 (five percent), then the result is considered statistically significant.

Page 12: Sept. 30 - Pennsylvania State Universitypersonal.psu.edu/drh20/100/lectures/lecture14Sept30.pdf · Sept. 30 In Chapter 10, try exercises 8, 15, 20 on pages 222–224. Monday’s assignment:

Correlation (r): Strength of linear association

r = 0.877 r = −0.149

5 10 15 20 25 30 35 40

50

100

150

200

Age

Weight

p-value = 0.123 p-value = 0.199

Not significant

Not significant

0 50 100 150 2002.0

2.5

3.0

3.5

4.0

Fastest Speed Ever Driven

Gra

de P

oint

Ave

rage

Page 13: Sept. 30 - Pennsylvania State Universitypersonal.psu.edu/drh20/100/lectures/lecture14Sept30.pdf · Sept. 30 In Chapter 10, try exercises 8, 15, 20 on pages 222–224. Monday’s assignment:

Average number of words a child knows at various ages

Imagine a scatter plot of the average number Of words a child knows for ages about 1.5 to 6. Relationship is nearly linear and quite strong.

Page 14: Sept. 30 - Pennsylvania State Universitypersonal.psu.edu/drh20/100/lectures/lecture14Sept30.pdf · Sept. 30 In Chapter 10, try exercises 8, 15, 20 on pages 222–224. Monday’s assignment:

Note the problem of extrapolation here: At age 1, predicted size of vocabulary is –251! Extrapolation means trying to predict beyond the range of the explanatory variable. (Remember the Sept. 12 example of running times?)

2 3 4 5 6

0

500

1000

1500

2000

2500

Age

Wor

dskn

own

Y = -806 + 555 X

S = 158.602 R-Sq = 97.1 % R-Sq(adj) = 96.6 %

Regression Plot

Correlation = .985

Page 15: Sept. 30 - Pennsylvania State Universitypersonal.psu.edu/drh20/100/lectures/lecture14Sept30.pdf · Sept. 30 In Chapter 10, try exercises 8, 15, 20 on pages 222–224. Monday’s assignment:

Another problem: Sometimes we see strong relationship in absurd examples. Two seemingly unrelated variables have a high correlation. This signals the presence of a third variable that is highly correlated with the other two. (Confounding)

Page 16: Sept. 30 - Pennsylvania State Universitypersonal.psu.edu/drh20/100/lectures/lecture14Sept30.pdf · Sept. 30 In Chapter 10, try exercises 8, 15, 20 on pages 222–224. Monday’s assignment:

Vocabulary vs Shoe Size

2 3 4 5 6

0

500

1000

1500

2000

2500

Shoe Size

Wor

dskn

own

Y = -806 + 555 X

S = 158.602 R-Sq = 97.1 % R-Sq(adj) = 96.6 %

Regression Plot

Correlation = .985

Page 17: Sept. 30 - Pennsylvania State Universitypersonal.psu.edu/drh20/100/lectures/lecture14Sept30.pdf · Sept. 30 In Chapter 10, try exercises 8, 15, 20 on pages 222–224. Monday’s assignment:

How can we have such high correlation between shoe size and vocabulary? (Note: These data were made up.) Easy: Both increase with age and hence age is a confounding (or hidden or lurking) variable. Age is positively correlated with both shoe size and with vocabulary.

Page 18: Sept. 30 - Pennsylvania State Universitypersonal.psu.edu/drh20/100/lectures/lecture14Sept30.pdf · Sept. 30 In Chapter 10, try exercises 8, 15, 20 on pages 222–224. Monday’s assignment:

Outliers Outliers are data that are not compatible with the bulk of the data. They show up in graphical displays as detached or stray points. Sometimes they indicate errors in data input. (This is quite a common occurrence!) Sometimes they are the most important data points.

Page 19: Sept. 30 - Pennsylvania State Universitypersonal.psu.edu/drh20/100/lectures/lecture14Sept30.pdf · Sept. 30 In Chapter 10, try exercises 8, 15, 20 on pages 222–224. Monday’s assignment:

PA Election Fraud (case study 23.1, page 508)

Special election to fill state senate seat in 1993. William Stinson (D) received 19,127 machine counted votes Bruce Marks (R) received 19,691 machine votes Stinson got 1,391 absentee votes Marks got 366 Stinson wins by 461 votes Question: Is this an unusual number of absentee votes?

Page 20: Sept. 30 - Pennsylvania State Universitypersonal.psu.edu/drh20/100/lectures/lecture14Sept30.pdf · Sept. 30 In Chapter 10, try exercises 8, 15, 20 on pages 222–224. Monday’s assignment:

Dem minus Rep vote counts (so positive means D ahead) For absentee versus machine

750050002500 0

1000

0

-1000

machine

abse

ntee

S = 294.363 R-Sq = 62.0 % R-Sq(adj) = 57.8 % - 0.0000285 machine**2

absentee = -182.575 + 0.295319 machine

95% PI

Regression

Regression Plot